Multiple problems with Graylog

1. Describe your incident:
Graylog server was up and running and receiving logs, but not from certain hosts. All inputs show not running, but all inputs are receiving messages. All sidecar logs show x.509 certificate not valid error. Recently was able to restart sidecar on one of the vm’s sending logs, but a few days after this, the graylog web admin site began to intermittently become unavailable. If you reboot the server, it becomes available for a time, a few days, then it times out if you try to access it.

2. Describe your environment:

  • OS Information:
    Sidecars are on Windows. Graylog server is on CentOS7.

  • Package Version:
    graylog-server-3.0.2-1.noarch
    elasticsearch-6.8.0-1.noarch
    mongodb-org-tools-4.0.9-1.el7.x86_64
    monbodb-org-server-4.0.9-1.el7.x86_64
    mongodb-org-mongos-4.0.9-1.el7.x86_64
    mongodb-org-shell-4.0.9-1.el7.x86_64

  • Service logs, configurations, and environment variables:
    Please advise more specifically what is needed here, including file locations of logs. I am new at this, so need some guidance on what is needed. Thank you in advance.

3. What steps have you already taken to try and solve the problem?
Restarted sidecar service on one reporting vm, which worked to get messages flowing into the existing sidecar and inputs, but shortly after that, the graylog admin page became unavailable. I have reinstalled sidecar and nxlog on another vm that is not sending logs, and that has had no effect on getting logs to the graylog server from that vm. I have reviewed many posts on the community, but not sure where to begin.

4. How can the community help?
Please help figure out where to begin, since I think there are multiple issues.

Any help or guidance would be useful. Not averse to posting whatever is needed to help troubleshoot, so please help if you can. Not sure where to begin. Thanks.

By the way, if I do:
tail -f /var/log/graylog-server/server.log

Then I get:
ERROR [DecodingProcessor] Unable to decode raw message RawMessage{ … etc.
ERROR [DecodingProcessor] Error processing message RawMessage{ … etc.

Can anyone advise what that means?

Hello && Welcome @fffhurst

If you could please be Be Patient the community members here have full time jobs and we help here on our own time so if you post was over looked it probably because of the lack of information.

Ok so I’m going to pick your post apart.

You defiantly may have a issue with certificates. How and where did you make these certs for your inputs.

If the input on Graylog do not show running , but able to see message your issue would be from there. Knowing that all parts of your Graylog server is functioning correctly is the fist step.

Seeing your full logs would help. I cant fix a CAR if I don’t see it :thinking:. Logs give us the ability to solve these issues. This is probable a couple issues, Disk space permission or the Graylog configuration made etc…

Perhaps check here for anything that may pertain to this issue.

Your Graylog config file maybe incorrect, showing that would be a start and please substitute any personal information while using the markup this is shown in the text box. If your unsure please look here for further information.

That would be your Input, Lets say someone is trying to put a round peg in a square hole.
What you have here my friend in the incorrect INPUT for the type of log shipped. This would be good information to know on how you setup your environment.

2 Likes

Thank you for your response. I am happy to be patient and appreciate any help that is offered. It will take me a few days to gather all the information.

1 Like

One question. The link you posted says that smaller code snippets should be surrounded by three backticks:

This is code.

How does one insert a summary block?

For now, here is the edited server.conf file for a start:

############################
# GRAYLOG CONFIGURATION FILE
############################


# (deleted initial explanatory part of file)

# If you are running more than one instances of Graylog server you have to select one of these
# instances as master. The master will perform some periodical tasks that non-masters won't perform.
is_master = true

# The auto-generated node ID will be stored in this file and read after restarts. It is a good idea
# to use an absolute file path here if you are starting Graylog server from init scripts or similar.
node_id_file = /etc/graylog/server/node-id

# You MUST set a secret to secure/pepper the stored user passwords here. Use at least 64 characters.
# Generate one by using for example: pwgen -N 1 -s 96
password_secret = <removed secret>

# The default root user is named 'admin'
root_username = root

# You MUST specify a hash password for the root user (which you only need to initially set up the
# system and in case you lose connectivity to your authentication backend)
# This password cannot be changed using the API or via the web interface. If you need to change it,
# modify it in this file.
# Create one by using for example: echo -n yourpassword | shasum -a 256
# and put the resulting hash value into the following line
root_password_sha2 = <removed password>

# The email address of the root user.
# Default is empty
#root_email = ""

# The time zone setting of the root user. See http://www.joda.org/joda-time/timezones.html for a list of valid time zones.
# Default is UTC
#root_timezone = UTC

# Set the bin directory here (relative or absolute)
# This directory contains binaries that are used by the Graylog server.
# Default: bin
bin_dir = /usr/share/graylog-server/bin

# Set the data directory here (relative or absolute)
# This directory is used to store Graylog server state.
# Default: data
data_dir = /var/lib/graylog-server

# Set plugin directory here (relative or absolute)
plugin_dir = /usr/share/graylog-server/plugin

###############
# HTTP settings
###############

#### HTTP bind address
#
# The network interface used by the Graylog HTTP interface.
#
# This network interface must be accessible by all Graylog nodes in the cluster and by all clients
# using the Graylog web interface.
#
# If the port is omitted, Graylog will use port 9000 by default.
#
# Default: 127.0.0.1:9000
http_bind_address = <removed http_bind_address>

#### HTTP publish URI
#
# The HTTP URI of this Graylog node which is used to communicate with the other Graylog nodes in the cluster and by all
# clients using the Graylog web interface.
#
# The URI will be published in the cluster discovery APIs, so that other Graylog nodes will be able to find and connect to this Graylog node.
#
# This configuration setting has to be used if this Graylog node is available on another network interface than $http_bind_address,
# for example if the machine has multiple network interfaces or is behind a NAT gateway.
#
# If $http_bind_address contains a wildcard IPv4 address (0.0.0.0), the first non-loopback IPv4 address of this machine will be used.
# This configuration setting *must not* contain a wildcard address!
#
# Default: http://$http_bind_address/
http_publish_uri = <removed URI>

#### External Graylog URI
#
# The public URI of Graylog which will be used by the Graylog web interface to communicate with the Graylog REST API.
#
# The external Graylog URI usually has to be specified, if Graylog is running behind a reverse proxy or load-balancer
# and it will be used to generate URLs addressing entities in the Graylog REST API (see $http_bind_address).
#
# When using Graylog Collector, this URI will be used to receive heartbeat messages and must be accessible for all collectors.
#
# This setting can be overriden on a per-request basis with the "X-Graylog-Server-URL" HTTP request header.
#
# Default: $http_publish_uri
#http_external_uri =

#### Enable CORS headers for HTTP interface
#
# This is necessary for JS-clients accessing the server directly.
# If these are disabled, modern browsers will not be able to retrieve resources from the server.
# This is enabled by default. Uncomment the next line to disable it.
#http_enable_cors = false

#### Enable GZIP support for HTTP interface
#
# This compresses API responses and therefore helps to reduce
# overall round trip times. This is enabled by default. Uncomment the next line to disable it.
#http_enable_gzip = false

# The maximum size of the HTTP request headers in bytes.
#http_max_header_size = 8192

# The size of the thread pool used exclusively for serving the HTTP interface.
#http_thread_pool_size = 16

################
# HTTPS settings
################

#### Enable HTTPS support for the HTTP interface
#
# This secures the communication with the HTTP interface with TLS to prevent request forgery and eavesdropping.
#
# Default: false
http_enable_tls = true

# The X.509 certificate chain file in PEM format to use for securing the HTTP interface.
http_tls_cert_file = <removed cert file location>

# The PKCS#8 private key file in PEM format to use for securing the HTTP interface.
http_tls_key_file = <removed key location>

# The password to unlock the private key used for securing the HTTP interface.
http_tls_key_password = <removed password>

# Comma separated list of trusted proxies that are allowed to set the client address with X-Forwarded-For
# header. May be subnets, or hosts.
#trusted_proxies = 127.0.0.1/32, 0:0:0:0:0:0:0:1/128

# List of Elasticsearch hosts Graylog should connect to.
# Need to be specified as a comma-separated list of valid URIs for the http ports of your elasticsearch nodes.
# If one or more of your elasticsearch hosts require authentication, include the credentials in each node URI that
# requires authentication.
#
# Default: http://127.0.0.1:9200
#elasticsearch_hosts = http://node1:9200,http://user:password@node2:19200

# Maximum amount of time to wait for successfull connection to Elasticsearch HTTP port.
#
# Default: 10 Seconds
#elasticsearch_connect_timeout = 10s

# Maximum amount of time to wait for reading back a response from an Elasticsearch server.
#
# Default: 60 seconds
#elasticsearch_socket_timeout = 60s

# Maximum idle time for an Elasticsearch connection. If this is exceeded, this connection will
# be tore down.
#
# Default: inf
#elasticsearch_idle_timeout = -1s

# Maximum number of total connections to Elasticsearch.
#
# Default: 20
#elasticsearch_max_total_connections = 20

# Maximum number of total connections per Elasticsearch route (normally this means per
# elasticsearch server).
#
# Default: 2
#elasticsearch_max_total_connections_per_route = 2

# Maximum number of times Graylog will retry failed requests to Elasticsearch.
#
# Default: 2
#elasticsearch_max_retries = 2

# Enable automatic Elasticsearch node discovery through Nodes Info,
# see https://www.elastic.co/guide/en/elasticsearch/reference/5.4/cluster-nodes-info.html
#
# WARNING: Automatic node discovery does not work if Elasticsearch requires authentication, e. g. with Shield.
#
# Default: false
#elasticsearch_discovery_enabled = true

# Filter for including/excluding Elasticsearch nodes in discovery according to their custom attributes,
# see https://www.elastic.co/guide/en/elasticsearch/reference/5.4/cluster.html#cluster-nodes
#
# Default: empty
#elasticsearch_discovery_filter = rack:42

# Frequency of the Elasticsearch node discovery.
#
# Default: 30s
# elasticsearch_discovery_frequency = 30s

# Enable payload compression for Elasticsearch requests.
#
# Default: false
#elasticsearch_compression_enabled = true

# Graylog will use multiple indices to store documents in. You can configured the strategy it uses to determine
# when to rotate the currently active write index.
# It supports multiple rotation strategies:
#   - "count" of messages per index, use elasticsearch_max_docs_per_index below to configure
#   - "size" per index, use elasticsearch_max_size_per_index below to configure
# valid values are "count", "size" and "time", default is "count"
#
# ATTENTION: These settings have been moved to the database in 2.0. When you upgrade, make sure to set these
#            to your previous 1.x settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
rotation_strategy = count

# (Approximate) maximum number of documents in an Elasticsearch index before a new index
# is being created, also see no_retention and elasticsearch_max_number_of_indices.
# Configure this if you used 'rotation_strategy = count' above.
#
# ATTENTION: These settings have been moved to the database in 2.0. When you upgrade, make sure to set these
#            to your previous 1.x settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
elasticsearch_max_docs_per_index = 20000000

# (Approximate) maximum size in bytes per Elasticsearch index on disk before a new index is being created, also see
# no_retention and elasticsearch_max_number_of_indices. Default is 1GB.
# Configure this if you used 'rotation_strategy = size' above.
#
# ATTENTION: These settings have been moved to the database in 2.0. When you upgrade, make sure to set these
#            to your previous 1.x settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
#elasticsearch_max_size_per_index = 1073741824

# (Approximate) maximum time before a new Elasticsearch index is being created, also see
# no_retention and elasticsearch_max_number_of_indices. Default is 1 day.
# Configure this if you used 'rotation_strategy = time' above.
# Please note that this rotation period does not look at the time specified in the received messages, but is
# using the real clock value to decide when to rotate the index!
# Specify the time using a duration and a suffix indicating which unit you want:
#  1w  = 1 week
#  1d  = 1 day
#  12h = 12 hours
# Permitted suffixes are: d for day, h for hour, m for minute, s for second.
#
# ATTENTION: These settings have been moved to the database in 2.0. When you upgrade, make sure to set these
#            to your previous 1.x settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
#elasticsearch_max_time_per_index = 1d

# Disable checking the version of Elasticsearch for being compatible with this Graylog release.
# WARNING: Using Graylog with unsupported and untested versions of Elasticsearch may lead to data loss!
#elasticsearch_disable_version_check = true

# Disable message retention on this node, i. e. disable Elasticsearch index rotation.
#no_retention = false

# How many indices do you want to keep?
#
# ATTENTION: These settings have been moved to the database in 2.0. When you upgrade, make sure to set these
#            to your previous 1.x settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
elasticsearch_max_number_of_indices = 20

# Decide what happens with the oldest indices when the maximum number of indices is reached.
# The following strategies are availble:
#   - delete # Deletes the index completely (Default)
#   - close # Closes the index and hides it from the system. Can be re-opened later.
#
# ATTENTION: These settings have been moved to the database in 2.0. When you upgrade, make sure to set these
#            to your previous 1.x settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
retention_strategy = delete

# How many Elasticsearch shards and replicas should be used per index? Note that this only applies to newly created indices.
# ATTENTION: These settings have been moved to the database in Graylog 2.2.0. When you upgrade, make sure to set these
#            to your previous settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
elasticsearch_shards = 4
elasticsearch_replicas = 0

# Prefix for all Elasticsearch indices and index aliases managed by Graylog.
#
# ATTENTION: These settings have been moved to the database in Graylog 2.2.0. When you upgrade, make sure to set these
#            to your previous settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
elasticsearch_index_prefix = graylog

# Name of the Elasticsearch index template used by Graylog to apply the mandatory index mapping.
# Default: graylog-internal
#
# ATTENTION: These settings have been moved to the database in Graylog 2.2.0. When you upgrade, make sure to set these
#            to your previous settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
#elasticsearch_template_name = graylog-internal

# Do you want to allow searches with leading wildcards? This can be extremely resource hungry and should only
# be enabled with care. See also: http://docs.graylog.org/en/2.1/pages/queries.html
allow_leading_wildcard_searches = false

# Do you want to allow searches to be highlighted? Depending on the size of your messages this can be memory hungry and
# should only be enabled after making sure your Elasticsearch cluster has enough memory.
allow_highlighting = false

# Analyzer (tokenizer) to use for message and full_message field. The "standard" filter usually is a good idea.
# All supported analyzers are: standard, simple, whitespace, stop, keyword, pattern, language, snowball, custom
# Elasticsearch documentation: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/analysis.html
# Note that this setting only takes effect on newly created indices.
#
# ATTENTION: These settings have been moved to the database in Graylog 2.2.0. When you upgrade, make sure to set these
#            to your previous settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
elasticsearch_analyzer = standard

# Global request timeout for Elasticsearch requests (e. g. during search, index creation, or index time-range
# calculations) based on a best-effort to restrict the runtime of Elasticsearch operations.
# Default: 1m
#elasticsearch_request_timeout = 1m

# Global timeout for index optimization (force merge) requests.
# Default: 1h
#elasticsearch_index_optimization_timeout = 1h

# Maximum number of concurrently running index optimization (force merge) jobs.
# If you are using lots of different index sets, you might want to increase that number.
# Default: 20
#elasticsearch_index_optimization_jobs = 20

# Time interval for index range information cleanups. This setting defines how often stale index range information
# is being purged from the database.
# Default: 1h
#index_ranges_cleanup_interval = 1h

# Time interval for the job that runs index field type maintenance tasks like cleaning up stale entries. This doesn't
# need to run very often.
# Default: 1h
#index_field_type_periodical_interval = 1h

# Batch size for the Elasticsearch output. This is the maximum (!) number of messages the Elasticsearch output
# module will get at once and write to Elasticsearch in a batch call. If the configured batch size has not been
# reached within output_flush_interval seconds, everything that is available will be flushed at once. Remember
# that every outputbuffer processor manages its own batch and performs its own batch write calls.
# ("outputbuffer_processors" variable)
output_batch_size = 500

# Flush interval (in seconds) for the Elasticsearch output. This is the maximum amount of time between two
# batches of messages written to Elasticsearch. It is only effective at all if your minimum number of messages
# for this time period is less than output_batch_size * outputbuffer_processors.
output_flush_interval = 1

# As stream outputs are loaded only on demand, an output which is failing to initialize will be tried over and
# over again. To prevent this, the following configuration options define after how many faults an output will
# not be tried again for an also configurable amount of seconds.
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

# The number of parallel running processors.
# Raise this number if your buffers are filling up.
processbuffer_processors = 5
outputbuffer_processors = 3

# The following settings (outputbuffer_processor_*) configure the thread pools backing each output buffer processor.
# See https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html for technical details

# When the number of threads is greater than the core (see outputbuffer_processor_threads_core_pool_size),
# this is the maximum time in milliseconds that excess idle threads will wait for new tasks before terminating.
# Default: 5000
#outputbuffer_processor_keep_alive_time = 5000

# The number of threads to keep in the pool, even if they are idle, unless allowCoreThreadTimeOut is set
# Default: 3
#outputbuffer_processor_threads_core_pool_size = 3

# The maximum number of threads to allow in the pool
# Default: 30
#outputbuffer_processor_threads_max_pool_size = 30

# UDP receive buffer size for all message inputs (e. g. SyslogUDPInput).
#udp_recvbuffer_sizes = 1048576

# Wait strategy describing how buffer processors wait on a cursor sequence. (default: sleeping)
# Possible types:
#  - yielding
#     Compromise between performance and CPU usage.
#  - sleeping
#     Compromise between performance and CPU usage. Latency spikes can occur after quiet periods.
#  - blocking
#     High throughput, low latency, higher CPU usage.
#  - busy_spinning
#     Avoids syscalls which could introduce latency jitter. Best when threads can be bound to specific CPU cores.
processor_wait_strategy = blocking

# Size of internal ring buffers. Raise this if raising outputbuffer_processors does not help anymore.
# For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache.
# Must be a power of 2. (512, 1024, 2048, ...)
ring_size = 65536

inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

# Enable the disk based message journal.
message_journal_enabled = true

# The directory which will be used to store the message journal. The directory must me exclusively used by Graylog and
# must not contain any other files than the ones created by Graylog itself.
#
# ATTENTION:
#   If you create a seperate partition for the journal files and use a file system creating directories like 'lost+found'
#   in the root directory, you need to create a sub directory for your journal.
#   Otherwise Graylog will log an error message that the journal is corrupt and Graylog will not start.
message_journal_dir = /var/lib/graylog-server/journal

# Journal hold messages before they could be written to Elasticsearch.
# For a maximum of 12 hours or 5 GB whichever happens first.
# During normal operation the journal will be smaller.
#message_journal_max_age = 12h
#message_journal_max_size = 5gb

#message_journal_flush_age = 1m
#message_journal_flush_interval = 1000000
#message_journal_segment_age = 1h
#message_journal_segment_size = 100mb

# Number of threads used exclusively for dispatching internal events. Default is 2.
#async_eventbus_processors = 2

# How many seconds to wait between marking node as DEAD for possible load balancers and starting the actual
# shutdown process. Set to 0 if you have no status checking load balancers in front.
lb_recognition_period_seconds = 3

# Journal usage percentage that triggers requesting throttling for this server node from load balancers. The feature is
# disabled if not set.
#lb_throttle_threshold_percentage = 95

# Every message is matched against the configured streams and it can happen that a stream contains rules which
# take an unusual amount of time to run, for example if its using regular expressions that perform excessive backtracking.
# This will impact the processing of the entire server. To keep such misbehaving stream rules from impacting other
# streams, Graylog limits the execution time for each stream.
# The default values are noted below, the timeout is in milliseconds.
# If the stream matching for one stream took longer than the timeout value, and this happened more than "max_faults" times
# that stream is disabled and a notification is shown in the web interface.
#stream_processing_timeout = 2000
#stream_processing_max_faults = 3

# Length of the interval in seconds in which the alert conditions for all streams should be checked
# and alarms are being sent.
#alert_check_interval = 60

# Since 0.21 the Graylog server supports pluggable output modules. This means a single message can be written to multiple
# outputs. The next setting defines the timeout for a single output module, including the default output module where all
# messages end up.
#
# Time in milliseconds to wait for all message outputs to finish writing a single message.
#output_module_timeout = 10000

# Time in milliseconds after which a detected stale master node is being rechecked on startup.
#stale_master_timeout = 2000

# Time in milliseconds which Graylog is waiting for all threads to stop on shutdown.
#shutdown_timeout = 30000

# MongoDB connection string
# See https://docs.mongodb.com/manual/reference/connection-string/ for details
mongodb_uri = mongodb://mongo_admin:<removed connection string>

# Authenticate against the MongoDB server
# '+'-signs in the username or password need to be replaced by '%2B'
mongodb_uri = mongodb://mongo_admin:<removed>

# Use a replica set instead of a single host
mongodb_uri = mongodb://mongo_admin:<removed>

# Increase this value according to the maximum connections your MongoDB server can handle from a single client
# if you encounter MongoDB connection problems.
mongodb_max_connections = 1000

# Number of threads allowed to be blocked by MongoDB connections multiplier. Default: 5
# If mongodb_max_connections is 100, and mongodb_threads_allowed_to_block_multiplier is 5,
# then 500 threads can block. More than that and an exception will be thrown.
# http://api.mongodb.com/java/current/com/mongodb/MongoOptions.html#threadsAllowedToBlockForConnectionMultiplier
mongodb_threads_allowed_to_block_multiplier = 5


# Email transport
transport_email_enabled = true
transport_email_hostname = <removed>
transport_email_port = 25
transport_email_use_auth = false
#transport_email_auth_username = 
#transport_email_auth_password =
transport_email_subject_prefix = [graylog]
transport_email_from_email = <removed>

# Encryption settings
#
# ATTENTION:
#    Using SMTP with STARTTLS *and* SMTPS at the same time is *not* possible.

# Use SMTP with STARTTLS, see https://en.wikipedia.org/wiki/Opportunistic_TLS
transport_email_use_tls = false

# Use SMTP over SSL (SMTPS), see https://en.wikipedia.org/wiki/SMTPS
# This is deprecated on most SMTP services!
#transport_email_use_ssl = false


# Specify and uncomment this if you want to include links to the stream in your stream alert mails.
# This should define the fully qualified base url to your web interface exactly the same way as it is accessed by your users.
transport_email_web_interface_url = <removed>

# The default connect timeout for outgoing HTTP connections.
# Values must be a positive duration (and between 1 and 2147483647 when converted to milliseconds).
# Default: 5s
#http_connect_timeout = 5s

# The default read timeout for outgoing HTTP connections.
# Values must be a positive duration (and between 1 and 2147483647 when converted to milliseconds).
# Default: 10s
#http_read_timeout = 10s

# The default write timeout for outgoing HTTP connections.
# Values must be a positive duration (and between 1 and 2147483647 when converted to milliseconds).
# Default: 10s
#http_write_timeout = 10s

# HTTP proxy for outgoing HTTP connections
# ATTENTION: If you configure a proxy, make sure to also configure the "http_non_proxy_hosts" option so internal
#            HTTP connections with other nodes does not go through the proxy.
# Examples:
#   - http://proxy.example.com:8123
#   - http://username:password@proxy.example.com:8123
#http_proxy_uri =

# A list of hosts that should be reached directly, bypassing the configured proxy server.
# This is a list of patterns separated by ",". The patterns may start or end with a "*" for wildcards.
# Any host matching one of these patterns will be reached through a direct connection instead of through a proxy.
# Examples:
#   - localhost,127.0.0.1
#   - 10.0.*,*.example.com
#http_non_proxy_hosts =

# Disable the optimization of Elasticsearch indices after index cycling. This may take some load from Elasticsearch
# on heavily used systems with large indices, but it will decrease search performance. The default is to optimize
# cycled indices.
#
# ATTENTION: These settings have been moved to the database in Graylog 2.2.0. When you upgrade, make sure to set these
#            to your previous settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
#disable_index_optimization = true

# Optimize the index down to <= index_optimization_max_num_segments. A higher number may take some load from Elasticsearch
# on heavily used systems with large indices, but it will decrease search performance. The default is 1.
#
# ATTENTION: These settings have been moved to the database in Graylog 2.2.0. When you upgrade, make sure to set these
#            to your previous settings so they will be migrated to the database!
#            This configuration setting is only used on the first start of Graylog. After that,
#            index related settings can be changed in the Graylog web interface on the 'System / Indices' page.
#            Also see http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration.
#index_optimization_max_num_segments = 1

# The threshold of the garbage collection runs. If GC runs take longer than this threshold, a system notification
# will be generated to warn the administrator about possible problems with the system. Default is 1 second.
#gc_warning_threshold = 1s

# Connection timeout for a configured LDAP server (e. g. ActiveDirectory) in milliseconds.
#ldap_connection_timeout = 2000

# Disable the use of SIGAR for collecting system stats
#disable_sigar = false

# The default cache time for dashboard widgets. (Default: 10 seconds, minimum: 1 second)
#dashboard_widget_default_cache_time = 10s

# For some cluster-related REST requests, the node must query all other nodes in the cluster. This is the maximum number
# of threads available for this. Increase it, if '/cluster/*' requests take long to complete.
# Should be http_thread_pool_size * average_cluster_size if you have a high number of concurrent users.
proxied_requests_thread_pool_size = 32

Excerpt from the logs, which had 87,000 lines in 2 hours repeating this:

2022-02-11T12:56:43.791-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=f9091d35-8b63-11ec-bfcc-0050568bf8e7, journalOffset=5289188359, codec=gelf, payloadSize=326, timestamp=2022-02-11T17:56:43.779Z, remoteAddress=/x.x.x.72:49611} on input <5cf13a5629fbc65472f9e843>.
2022-02-11T12:56:43.804-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=f9091d35-8b63-11ec-bfcc-0050568bf8e7, journalOffset=5289188359, codec=gelf, payloadSize=326, timestamp=2022-02-11T17:56:43.779Z, remoteAddress=/x.x.x.72:49611}
java.lang.IllegalArgumentException: GELF message <f9091d35-8b63-11ec-bfcc-0050568bf8e7> (received from <x.x.x.72:49611>) has empty mandatory "short_message" field.
	at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
	at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:150) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:91) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
	at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2022-02-11T12:56:43.793-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=f9091d31-8b63-11ec-bfcc-0050568bf8e7, journalOffset=5289188355, codec=gelf, payloadSize=326, timestamp=2022-02-11T17:56:43.779Z, remoteAddress=/x.x.x.72:49611} on input <5cf13a5629fbc65472f9e843>.
 

Hello,

It seams you have the same problem from this post.

As I stated earlier.

EDIT: tip on posting configuration files. If you execute this command, It will show only the commented lines that are needed which makes a lot easier to read.

cat /etc/graylog/server/server.conf | egrep -v "^\s*(#|$)"

The input is accepting logs from the vm that the error refers to, so any ideas on how it is both accepting logs and not accepting them at the same time?

The graylog admin page does not stay up long enough for me to actually copy and paste here the log, but the vm name is the same as the one belonging to the IP mentioned in the error above. How is that possible?

Hello @fffhurst

It seams that you have a few issues to deal with, the only way I know how would be take one issue at a time. Once that is completed then repeat the process again. I can only do so much from here. I do know you need to make Graylog stable first and move on from there.

If your seeing the same message ID from the error logs on your Web UI, this means Graylog is yelling that it had to fix the message because it maybe the wrong type for the input being used. If you know which device it is, perhaps try something like a RawplainText INPUT for that device and if you tried that already perhaps look into what type of logs that device is sending, which would be my first choice in troubleshooting. Another suggestion is if you do know which device it is, us a specific port for that device and try different INPUT with that port. If this is not a switch/firewall look into how the logs are shipped , maybe there is a clue there that could help.

This would be something like me asking you to fix my car and I all I told you was it wont start. To be honest, I really don’t know. why that is happening to you. We really don’t have a clear picture on what’s going on with your environment. If you show us what your seeing maybe we can solve this issue.

I understand. If you could let me know specifically what you would like to see that would be helpful.
My first order of business is to get the graylog web UI to remain working consistently, so to that end, I noticed that recently we restarted sidecar on one of the nodes that was reporting logs and that we have had problems with the graylog web UI since then. So, I stopped the graylog and nxlog services on that node in the hopes that would help. I then restarted the graylog server. Then next day I tried to log into the web UI. In the first 3 minutes after I logged in, it was working fine. Then in about 4 minutes the web UI went down again. The graylog server logs show a new error after the server restart as compared to before the server restart.

Before server restart:

2022-02-15T12:05:12.261-05:00 ERROR [DecodingProcessor] Unable to decode raw message RawMessage{id=6ffdeb87-8e81-11ec-a171-0050568bf8e7, journalOffset=5308282893, codec=gelf, payloadSize=326, timestamp=2022-02-15T17:05:12.248Z, remoteAddress=/x.x.x.72:49238} on input <5cf13a5629fbc65472f9e843>.
2022-02-15T12:05:12.261-05:00 ERROR [DecodingProcessor] Error processing message RawMessage{id=6ffdeb87-8e81-11ec-a171-0050568bf8e7, journalOffset=5308282893, codec=gelf, payloadSize=326, timestamp=2022-02-15T17:05:12.248Z, remoteAddress=/x.x.x.72:49238}
java.lang.IllegalArgumentException: GELF message <6ffdeb87-8e81-11ec-a171-0050568bf8e7> (received from <x.x.x.72:49238>) has empty mandatory "short_message" field.
	at org.graylog2.inputs.codecs.GelfCodec.validateGELFMessage(GelfCodec.java:252) ~[graylog.jar:?]
	at org.graylog2.inputs.codecs.GelfCodec.decode(GelfCodec.java:134) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.processMessage(DecodingProcessor.java:150) ~[graylog.jar:?]
	at org.graylog2.shared.buffers.processors.DecodingProcessor.onEvent(DecodingProcessor.java:91) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:74) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
	at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]

Server restart:

2022-02-15T12:09:08.992-05:00 INFO  [Server] SIGNAL received. Shutting down.
2022-02-15T12:22:02.573-05:00 INFO  [CmdLineTool] Loaded plugin: AWS plugins 3.0.2 [org.graylog.aws.AWSPlugin]
2022-02-15T12:22:02.590-05:00 INFO  [CmdLineTool] Loaded plugin: Collector 3.0.2 [org.graylog.plugins.collector.CollectorPlugin]
2022-02-15T12:22:02.591-05:00 INFO  [CmdLineTool] Loaded plugin: Threat Intelligence Plugin 3.0.2 [org.graylog.plugins.threatintel.ThreatIntelPlugin]
2022-02-15T12:22:03.031-05:00 INFO  [CmdLineTool] Running with JVM arguments: -Xms2g -Xmx2g -XX:NewRatio=1 -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow -Djavax.net.ssl.truststore=/SSCLant/certs/cacerts.jks -Dlog4j2.formatMsgNoLookups=true -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml -Djava.library.path=/usr/share/graylog-server/lib/sigar -Dgraylog2.installation_source=rpm
2022-02-15T12:22:03.278-05:00 INFO  [Version] HV000001: Hibernate Validator 5.1.3.Final
2022-02-15T12:22:07.713-05:00 INFO  [InputBufferImpl] Message journal is enabled.
2022-02-15T12:22:07.764-05:00 INFO  [NodeId] Node ID: graylog-server-node-ID-#
2022-02-15T12:22:08.167-05:00 INFO  [LogManager] Loading logs.
2022-02-15T12:22:08.256-05:00 WARN  [Log] Found a corrupted index file, /var/lib/graylog-server/journal/messagejournal-0/00000000005308061109.index, deleting and rebuilding index...
2022-02-15T12:22:10.711-05:00 INFO  [LogManager] Logs loading complete.
2022-02-15T12:22:10.715-05:00 INFO  [KafkaJournal] Initialized Kafka based journal at /var/lib/graylog-server/journal
2022-02-15T12:22:10.726-05:00 INFO  [InputBufferImpl] Initialized InputBufferImpl with ring size <65536> and wait strategy <BlockingWaitStrategy>, running 2 parallel message handlers.
2022-02-15T12:22:10.747-05:00 INFO  [cluster] Cluster created with settings {hosts=[localhost:27017], mode=SINGLE, requiredClusterType=UNKNOWN, serverSelectionTimeout='30000 ms', maxWaitQueueSize=5000}
2022-02-15T12:22:10.799-05:00 INFO  [cluster] Cluster description not yet available. Waiting for 30000 ms before timing out
2022-02-15T12:22:11.125-05:00 INFO  [connection] Opened connection [connectionId{localValue:1, serverValue:1}] to localhost:27017
2022-02-15T12:22:11.150-05:00 INFO  [cluster] Monitor thread successfully connected to server with description ServerDescription{address=localhost:27017, type=STANDALONE, state=CONNECTED, ok=true, version=ServerVersion{versionList=[4, 0, 9]}, minWireVersion=0, maxWireVersion=7, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=23672075}
2022-02-15T12:22:11.252-05:00 INFO  [connection] Opened connection [connectionId{localValue:2, serverValue:2}] to localhost:27017
2022-02-15T12:22:11.879-05:00 INFO  [AbstractJestClient] Setting server pool to a list of 1 servers: [http://127.0.0.1:9200]
2022-02-15T12:22:11.880-05:00 INFO  [JestClientFactory] Using multi thread/connection supporting pooling connection manager
2022-02-15T12:22:11.960-05:00 INFO  [JestClientFactory] Using custom ObjectMapper instance
2022-02-15T12:22:11.960-05:00 INFO  [JestClientFactory] Node Discovery disabled...
2022-02-15T12:22:11.960-05:00 INFO  [JestClientFactory] Idle connection reaping disabled...
2022-02-15T12:22:12.202-05:00 INFO  [ProcessBuffer] Initialized ProcessBuffer with ring size <65536> and wait strategy <BlockingWaitStrategy>.
2022-02-15T12:22:14.752-05:00 WARN  [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2022-02-15T12:22:14.764-05:00 INFO  [OutputBuffer] Initialized OutputBuffer with ring size <65536> and wait strategy <BlockingWaitStrategy>.
2022-02-15T12:22:16.240-05:00 WARN  [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2022-02-15T12:22:17.756-05:00 WARN  [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2022-02-15T12:22:17.901-05:00 INFO  [connection] Opened connection [connectionId{localValue:3, serverValue:3}] to localhost:27017
2022-02-15T12:22:19.271-05:00 WARN  [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2022-02-15T12:22:20.622-05:00 WARN  [GeoIpResolverEngine] GeoIP database file does not exist: /etc/graylog/server/GeoLite2-City.mmdb
2022-02-15T12:22:21.229-05:00 INFO  [ServerBootstrap] Graylog server 3.0.2+1686930 starting up
2022-02-15T12:22:21.229-05:00 INFO  [ServerBootstrap] JRE: Red Hat, Inc. 1.8.0_282 on Linux 3.10.0-1160.15.2.el7.x86_64
2022-02-15T12:22:21.229-05:00 INFO  [ServerBootstrap] Deployment: rpm
2022-02-15T12:22:21.230-05:00 INFO  [ServerBootstrap] OS: CentOS Linux 7 (Core) (centos)
2022-02-15T12:22:21.230-05:00 INFO  [ServerBootstrap] Arch: amd64
2022-02-15T12:22:21.303-05:00 INFO  [PeriodicalsService] Starting 27 periodicals ...
2022-02-15T12:22:21.304-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.ThroughputCalculator] periodical in [0s], polling every [1s].
2022-02-15T12:22:21.320-05:00 INFO  [Periodicals] Starting [org.graylog.plugins.pipelineprocessor.periodical.LegacyDefaultStreamMigration] periodical, running forever.
2022-02-15T12:22:21.320-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.AlertScannerThread] periodical in [10s], polling every [60s].
2022-02-15T12:22:21.321-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.BatchedElasticSearchOutputFlushThread] periodical in [0s], polling every [1s].
2022-02-15T12:22:21.321-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.ClusterHealthCheckThread] periodical in [120s], polling every [20s].
2022-02-15T12:22:21.326-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.GarbageCollectionWarningThread] periodical, running forever.
2022-02-15T12:22:21.326-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.IndexerClusterCheckerThread] periodical in [0s], polling every [30s].
2022-02-15T12:22:21.328-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.IndexRetentionThread] periodical in [0s], polling every [300s].
2022-02-15T12:22:21.338-05:00 INFO  [LegacyDefaultStreamMigration] Legacy default stream has no connections, no migration needed.
2022-02-15T12:22:21.342-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.IndexRotationThread] periodical in [0s], polling every [10s].
2022-02-15T12:22:21.343-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.NodePingThread] periodical in [0s], polling every [1s].
2022-02-15T12:22:21.344-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.VersionCheckThread] periodical in [300s], polling every [1800s].
2022-02-15T12:22:21.345-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.ThrottleStateUpdaterThread] periodical in [1s], polling every [1s].
2022-02-15T12:22:21.345-05:00 INFO  [Periodicals] Starting [org.graylog2.events.ClusterEventPeriodical] periodical in [0s], polling every [1s].
2022-02-15T12:22:21.347-05:00 INFO  [Periodicals] Starting [org.graylog2.events.ClusterEventCleanupPeriodical] periodical in [0s], polling every [86400s].
2022-02-15T12:22:21.348-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.ClusterIdGeneratorPeriodical] periodical, running forever.
2022-02-15T12:22:21.348-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.IndexRangesMigrationPeriodical] periodical, running forever.
2022-02-15T12:22:21.349-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.IndexRangesCleanupPeriodical] periodical in [15s], polling every [3600s].
2022-02-15T12:22:21.461-05:00 INFO  [connection] Opened connection [connectionId{localValue:8, serverValue:8}] to localhost:27017
2022-02-15T12:22:21.463-05:00 INFO  [connection] Opened connection [connectionId{localValue:4, serverValue:4}] to localhost:27017
2022-02-15T12:22:21.470-05:00 INFO  [connection] Opened connection [connectionId{localValue:10, serverValue:10}] to localhost:27017
2022-02-15T12:22:21.471-05:00 INFO  [connection] Opened connection [connectionId{localValue:5, serverValue:5}] to localhost:27017
2022-02-15T12:22:21.485-05:00 INFO  [connection] Opened connection [connectionId{localValue:6, serverValue:6}] to localhost:27017
2022-02-15T12:22:21.488-05:00 INFO  [PeriodicalsService] Not starting [org.graylog2.periodical.UserPermissionMigrationPeriodical] periodical. Not configured to run on this node.
2022-02-15T12:22:21.488-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.AlarmCallbacksMigrationPeriodical] periodical, running forever.
2022-02-15T12:22:21.488-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.ConfigurationManagementPeriodical] periodical, running forever.
2022-02-15T12:22:21.491-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.LdapGroupMappingMigration] periodical, running forever.
2022-02-15T12:22:21.492-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.IndexFailuresPeriodical] periodical, running forever.
2022-02-15T12:22:21.492-05:00 INFO  [Periodicals] Starting [org.graylog2.periodical.TrafficCounterCalculator] periodical in [0s], polling every [1s].
2022-02-15T12:22:21.493-05:00 INFO  [Periodicals] Starting [org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical] periodical in [0s], polling every [3600s].
2022-02-15T12:22:21.493-05:00 INFO  [Periodicals] Starting [org.graylog.plugins.sidecar.periodical.PurgeExpiredSidecarsThread] periodical in [0s], polling every [600s].
2022-02-15T12:22:21.495-05:00 INFO  [Periodicals] Starting [org.graylog.plugins.sidecar.periodical.PurgeExpiredConfigurationUploads] periodical in [0s], polling every [600s].
2022-02-15T12:22:21.499-05:00 INFO  [Periodicals] Starting [org.graylog.plugins.collector.periodical.PurgeExpiredCollectorsThread] periodical in [0s], polling every [3600s].
2022-02-15T12:22:21.505-05:00 INFO  [connection] Opened connection [connectionId{localValue:7, serverValue:7}] to localhost:27017
2022-02-15T12:22:21.508-05:00 INFO  [IndexRetentionThread] Elasticsearch cluster not available, skipping index retention checks.
2022-02-15T12:22:21.512-05:00 ERROR [Cluster] Couldn't read cluster health for indices [graylog_*] (Could not connect to http://127.0.0.1:9200)
2022-02-15T12:22:21.512-05:00 INFO  [IndexerClusterCheckerThread] Indexer not fully initialized yet. Skipping periodic cluster check.
2022-02-15T12:22:21.513-05:00 INFO  [IndexFieldTypePollerPeriodical] Cluster not connected yet, delaying index field type initialization until it is reachable.
2022-02-15T12:22:21.513-05:00 INFO  [connection] Opened connection [connectionId{localValue:9, serverValue:9}] to localhost:27017
2022-02-15T12:22:21.626-05:00 INFO  [V20161130141500_DefaultStreamRecalcIndexRanges] Cluster not connected yet, delaying migration until it is reachable.
2022-02-15T12:22:21.833-05:00 INFO  [JerseyService] Enabling CORS for HTTP endpoint
2022-02-15T12:22:36.364-05:00 INFO  [IndexRangesCleanupPeriodical] Skipping index range cleanup because the Elasticsearch cluster is unreachable or unhealthy
2022-02-15T12:22:42.780-05:00 INFO  [NetworkListener] Started listener bound to [hostname.domainname:9000]
2022-02-15T12:22:42.782-05:00 INFO  [HttpServer] [HttpServer] Started.
2022-02-15T12:22:42.782-05:00 INFO  [JerseyService] Started REST API at <hostname.domainname:9000>
2022-02-15T12:22:42.784-05:00 INFO  [ServiceManagerListener] Services are healthy
2022-02-15T12:22:42.784-05:00 INFO  [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized [LB:DEAD] to Running [LB:ALIVE]
2022-02-15T12:22:42.785-05:00 INFO  [ServerBootstrap] Services started, startup times in ms: {InputSetupService [RUNNING]=39, ConfigurationEtagService [RUNNING]=48, EtagService [RUNNING]=48, GracefulShutdownService [RUNNING]=48, JournalReader [RUNNING]=49, OutputSetupService [RUNNING]=51, BufferSynchronizerService [RUNNING]=69, KafkaJournal [RUNNING]=70, LookupTableService [RUNNING]=252, PeriodicalsService [RUNNING]=256, StreamCacheService [RUNNING]=280, JerseyService [RUNNING]=21504}
2022-02-15T12:22:42.869-05:00 INFO  [InputStateListener] Input [Raw/Plaintext UDP/5cf1398829fbc65472f9e760] is now STARTING
2022-02-15T12:22:42.870-05:00 INFO  [InputStateListener] Input [Raw/Plaintext UDP/5cf13c1329fbc65472f9ea2a] is now STARTING
2022-02-15T12:22:42.872-05:00 INFO  [ServerBootstrap] Graylog server up and running.
2022-02-15T12:22:42.875-05:00 INFO  [InputStateListener] Input [Syslog TCP/5cf13c2e29fbc65472f9ea4a] is now STARTING
2022-02-15T12:22:42.885-05:00 INFO  [InputStateListener] Input [Syslog UDP/5cf13c4629fbc65472f9ea68] is now STARTING
2022-02-15T12:22:42.887-05:00 INFO  [InputStateListener] Input [GELF TCP/5cf13a5629fbc65472f9e843] is now STARTING
2022-02-15T12:22:43.293-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=APC Logs, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0xa02083af, L:/0.0.0.0:12201]) should be 262144 but is 425984.
2022-02-15T12:22:43.295-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=Firewall, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0xf63438ae, L:/0.0.0.0:1514]) should be 262144 but is 425984.
2022-02-15T12:22:43.312-05:00 WARN  [AbstractTcpTransport] receiveBufferSize (SO_RCVBUF) for input GELFTCPInput{title=Windows Event Logs, type=org.graylog2.inputs.gelf.tcp.GELFTCPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0xdee05088, L:/0.0.0.0:12201]) should be 1048576 but is 425984.
2022-02-15T12:22:43.322-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=Firewall, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x1a74ac44, L:/0.0.0.0:1514]) should be 262144 but is 425984.
2022-02-15T12:22:43.325-05:00 INFO  [InputStateListener] Input [Syslog TCP/5cf13c2e29fbc65472f9ea4a] is now RUNNING
2022-02-15T12:22:43.325-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=FirePOWER, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0xfd18c75c, L:/0.0.0.0:5140]) should be 262144 but is 425984.
2022-02-15T12:22:43.363-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=Firewall, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x86e50f83, L:/0.0.0.0:1514]) should be 262144 but is 425984.
2022-02-15T12:22:43.403-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=APC Logs, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0xee949a39, L:/0.0.0.0:12201]) should be 262144 but is 425984.
2022-02-15T12:22:43.474-05:00 WARN  [AbstractTcpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogTCPInput{title=COC Infrastructure, type=org.graylog2.inputs.syslog.tcp.SyslogTCPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x7771e50c, L:/0.0.0.0:1514]) should be 1048576 but is 425984.
2022-02-15T12:22:43.475-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=Firewall, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x9fd20f1b, L:/0.0.0.0:1514]) should be 262144 but is 425984.
2022-02-15T12:22:43.475-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=FirePOWER, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x7166aa02, L:/0.0.0.0:5140]) should be 262144 but is 425984.
2022-02-15T12:22:43.980-05:00 INFO  [InputStateListener] Input [Raw/Plaintext UDP/5cf1398829fbc65472f9e760] is now RUNNING
2022-02-15T12:22:44.056-05:00 INFO  [InputStateListener] Input [GELF TCP/5cf13a5629fbc65472f9e843] is now RUNNING
2022-02-15T12:22:44.069-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=APC Logs, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x81ddea8a, L:/0.0.0.0:12201]) should be 262144 but is 425984.
2022-02-15T12:22:44.070-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=FirePOWER, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x5db19602, L:/0.0.0.0:5140]) should be 262144 but is 425984.
2022-02-15T12:22:44.081-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input RawUDPInput{title=FirePOWER, type=org.graylog2.inputs.raw.udp.RawUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0x2ff3b956, L:/0.0.0.0:5140]) should be 262144 but is 425984.
2022-02-15T12:22:44.084-05:00 INFO  [InputStateListener] Input [Raw/Plaintext UDP/5cf13c1329fbc65472f9ea2a] is now RUNNING
2022-02-15T12:22:44.092-05:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=APC Logs, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=graylog-server-node-ID-#} (channel [id: 0xe7f27dc4, L:/0.0.0.0:12201]) should be 262144 but is 425984.
2022-02-15T12:22:44.095-05:00 INFO  [InputStateListener] Input [Syslog UDP/5cf13c4629fbc65472f9ea68] is now RUNNING

After server restart new error:

2022-02-15T12:43:33.681-05:00 WARN  [ProxiedResource] Unable to call https://hostname.domainname:9000/api/system/metrics/multiple on node <graylog-server-node-ID-#>
javax.net.ssl.SSLHandshakeException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: 
	at sun.security.ssl.Alert.createSSLException(Alert.java:131) ~[?:1.8.0_282]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:324) ~[?:1.8.0_282]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:267) ~[?:1.8.0_282]
	at sun.security.ssl.TransportContext.fatal(TransportContext.java:262) ~[?:1.8.0_282]
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:654) ~[?:1.8.0_282]
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.onCertificate(CertificateMessage.java:473) ~[?:1.8.0_282]
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.consume(CertificateMessage.java:369) ~[?:1.8.0_282]
	at sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:377) ~[?:1.8.0_282]
	at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444) ~[?:1.8.0_282]
	at sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:422) ~[?:1.8.0_282]
	at sun.security.ssl.TransportContext.dispatch(TransportContext.java:182) ~[?:1.8.0_282]
	at sun.security.ssl.SSLTransport.decode(SSLTransport.java:149) ~[?:1.8.0_282]
	at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1143) ~[?:1.8.0_282]
	at sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1054) ~[?:1.8.0_282]
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:394) ~[?:1.8.0_282]
	at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:318) ~[graylog.jar:?]
	at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:282) ~[graylog.jar:?]
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:167) ~[graylog.jar:?]
	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:257) ~[graylog.jar:?]
	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:135) ~[graylog.jar:?]
	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:114) ~[graylog.jar:?]
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
	at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?]
	at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?]
	at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?]
	at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:78) ~[graylog.jar:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:456) ~[?:1.8.0_282]
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:323) ~[?:1.8.0_282]
	at sun.security.validator.Validator.validate(Validator.java:271) ~[?:1.8.0_282]
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315) ~[?:1.8.0_282]
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:223) ~[?:1.8.0_282]
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:129) ~[?:1.8.0_282]
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:638) ~[?:1.8.0_282]
	... 38 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141) ~[?:1.8.0_282]
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126) ~[?:1.8.0_282]
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280) ~[?:1.8.0_282]
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:451) ~[?:1.8.0_282]
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:323) ~[?:1.8.0_282]
	at sun.security.validator.Validator.validate(Validator.java:271) ~[?:1.8.0_282]
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:315) ~[?:1.8.0_282]
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:223) ~[?:1.8.0_282]
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:129) ~[?:1.8.0_282]
	at sun.security.ssl.CertificateMessage$T12CertificateConsumer.checkServerCerts(CertificateMessage.java:638) ~[?:1.8.0_282]
	... 38 more

I am happy to provide whatever other information is needed to get pointed in the right direction. Thank you in advance for your help.

You may want to turn off all certs to see if things run without TLS/SSL since that was an issue to see if that makes a difference. I Also noticed that you are running MongoDB 4 with Graylog 3.0. You may need to make sure that Mongo is running in compatibility mode if it is not already. (see here) If you look at the Graylog 3.0 docs version (where you are at) , it says the max version of Mongo should be 3.6. I don’t know for sure at all if that is causing your issue but we need to eliminate what we can…

Hello @fffhurst

I agree with @tmacgbay ,

After looking over your logs the main concern was these which is the direct results of certificates.

Unable to call https://hostname.domainname:9000/api/system/metrics/multiple

PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target

The error below, I would check your ES , and check your firewall/Selinux make sure it not blocking anything (i.e. ports 9200,9000, 27017,9300), If you rebooted your server this maybe a direct result from that. If Graylog service was only restart then I would defiantly look into ES .

[Cluster] Couldn't read cluster health for indices [graylog_*] (Could not connect to http://127.0.0.1:9200)

Couple tips on checking Elasticsearch.

What do you see when executing this.

curl -XGET http://127.0.0.1:9200/_cluster/health?pretty=true

The command below will explain what/why if there is a problem, it will give you better clarity on the status of Elasticsearch if there is a problem.

curl  -XGET http://127.0.0.1:9200/_cluster/allocation/explain?pretty

So the moral of the story here is if you can, comment out the lines for TLS ( HTTPS) , restart GL service.

You should have just this one line

The log in by http://192.168.1.13:9000

If you do make these configuration make sure you tail your Graylog log file again.

Forget this for now, let's fix ES first

Well, by this it seems that even graylog’s own web interface cert may be bad.

You can either disable security (easier) or try to check the cert with

curl -XGET https://hostname.domainname:9000/api

Hello @nisow95612

This is true statement.

If he shows the data from the commands above this could be solved. It could even be he just need to rotate or recalculate his index via web ui.

Ok. Please be a little more clear. What are we forgetting and which of the instructions above is involved to “fix ES”. Does fix ES involve commenting out lines to disable TLS or something else?

The first line is misleading.

All of your errors have to do with misconfiguration of certificates in your installation.

You can start with verifying your certificate setup in Graylog (And make sure the certs are valid) or you can remove the entire certificate/SSL/TLS settings from your configuration to verify Graylog works, then step through adding it back. There are a bunch of diagnostic commands that have been offered, you can try running them and posting the results if they don’t lead you to a solution…

1 Like

My apologies @fffhurst
I was trying to answer this other members question. That was not directed toward you.

Here are the results of the curl two curl commands given above:

{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 80,
  "active_shards" : 80,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

So, I cannot disable TLS in the production environment, but I disabled TLS in an identical test environment. From test I cannot transfer so it has to be a screenshot:

There are no errors in your screen shot. Have you confirmed your certificate settings?

Possibly review all certificate settings in production as if you were putting them in for the first time…

1 Like