Graylog stopps processing messages. Only coming in but no out


(Peter Schoenegger) #1

We have installed Graylog using the latest OVA virtual appliance image (https://packages.graylog2.org/appliances/ova). We then upgraded the system to the latest Graylog release v2.4.6. For like 2 months everything worked perfeclty fine when sending in log data. The server has 32GB of RAM and 8 cores available.

When logging in today morning i recognized that messages are coming in (about 2500 msg/s) but there are no out messages. I then tried checking the logs in /var/log/graylog for finding a root cause, restarting the server and even doing a graylog-ctl reconfigure but there are still no messages being processed by the system. Can you please help me what the problem could be here?!

Please also find the “graylog-settings.json” and the current file below.

{
  "timezone": "Europe/Vienna",
  "smtp_server": "mail.XXX.XX",
  "smtp_port": 2500,
  "smtp_user": "",
  "smtp_password": "",
  "smtp_from_email": "graylog@XXX.XX",
  "smtp_web_url": "http://graylog-beta",
  "smtp_no_tls": true,
  "smtp_no_ssl": true,
  "master_node": "127.0.0.1",
  "local_connect": false,
  "current_address": "172.20.45.29",
  "last_address": "172.20.45.29",
  "enforce_ssl": false,
  "journal_size": 500,
  "node_id": false,
  "internal_logging": false,
  "web_listen_uri": false,
  "web_endpoint_uri": false,
  "rest_listen_uri": false,
  "rest_transport_uri": false,
  "external_rest_uri": false,
  "custom_attributes": {
    "graylog-server": {
      "memory": "10240m"
    },
    "elasticsearch": {
      "memory": "10240m"
    }
  }
}

When doing a restart of the server message processing works (about 200 to 2500 msg/s) for like a couple of minutes. For not losing messages I increased the journal size to 500GB…

Many thanks in advance for any help which is highly appreciated,

Peter

graylog.conf:
# If you are running more than one instances of graylog-server you have to select one of these
# instances as master. The master will perform some periodical tasks that non-masters won't perform.
is_master = true

# The auto-generated node ID will be stored in this file and read after restarts. It is a good idea
# to use an absolute file path here if you are starting graylog-server from init scripts or similar.
node_id_file = /var/opt/graylog/graylog-server-node-id

# You MUST set a secret to secure/pepper the stored user passwords here. Use at least 64 characters.
# Generate one by using for example: pwgen -N 1 -s 96
password_secret = eb2de5a7e5d1c93938eb5f47b5356c0e9f85a6252e448f3e567e8efc2499496208eda6d92c5cfcf2f69f134e842658ace00fd24d1916f9315ee69053ff1cb812

# the default root user is named 'admin'
root_username = admin

# You MUST specify a hash password for the root user (which you only need to initially set up the
# system and in case you lose connectivity to your authentication backend)
# This password cannot be changed using the API or via the web interface. If you need to change it,
# modify it in this file.
# Create one by using for example: echo -n yourpassword | shasum -a 256
# and put the resulting hash value into the following line
root_password_sha2 = 4f5d5aa0bc267de17b7194fecd54a85fe0d61ad46463fea621622f42f1e9e9ad

# The email address of the root user.
# Default is empty
#root_email = ""

# The time zone setting of the root user.
# Default is UTC
root_timezone = Europe/Vienna

# Set plugin directory here (relative or absolute)
plugin_dir = /opt/graylog/plugin

# REST API listen URI. Must be reachable by other graylog-server nodes if you run a cluster.
rest_listen_uri = http://0.0.0.0:9000/api

# Web interface listen URI
web_listen_uri = http://0.0.0.0:9000/

# REST API transport address. Defaults to the value of rest_listen_uri. Exception: If rest_listen_uri
# is set to a wildcard IP address (0.0.0.0) the first non-loopback IPv4 system address is used.
# If set, his will be promoted in the cluster discovery APIs, so other nodes may try to connect on
# this address and it is used to generate URLs addressing entities in the REST API. (see rest_listen_uri)
# You will need to define this, if your Graylog server is running behind a HTTP proxy that is rewriting
# the scheme, host name or URI.

# Web interface endpoint URI. This setting can be overriden on a per-request basis with the X-Graylog-Server-URL header.
# Default: $rest_transport_uri

# Enable CORS headers for REST api. This is necessary for JS-clients accessing the server directly.
# If these are disabled, modern browsers will not be able to retrieve resources from the server.
# This is disabled by default. Uncomment the next line to enable it.
rest_enable_cors = true

# Enable GZIP support for REST api. This compresses API responses and therefore helps to reduce
# overall round trip times. This is disabled by default. Uncomment the next line to enable it.
#rest_enable_gzip = true

# Enable HTTPS support for the REST API. This secures the communication with the REST API with
# TLS to prevent request forgery and eavesdropping. This is disabled by default. Uncomment the
# next line to enable it.
#rest_enable_tls = true

# The X.509 certificate file to use for securing the REST API.
#rest_tls_cert_file = /path/to/graylog2.crt

# The private key to use for securing the REST API.
#rest_tls_key_file = /path/to/graylog2.key

# The password to unlock the private key used for securing the REST API.
#rest_tls_key_password = secret

# The maximum size of a single HTTP chunk in bytes.
#rest_max_chunk_size = 8192

# The maximum size of the HTTP request headers in bytes.
#rest_max_header_size = 8192

# The maximal length of the initial HTTP/1.1 line in bytes.
#rest_max_initial_line_length = 4096

# The size of the thread pool used exclusively for serving the REST API.
#rest_thread_pool_size = 16

# The size of the worker thread pool used exclusively for serving the REST API.
#rest_worker_threads_max_pool_size = 16

# Disable checking the version of Elasticsearch for being compatible with this Graylog release.
# WARNING: Using Graylog with unsupported and untested versions of Elasticsearch may lead to data loss!
#elasticsearch_disable_version_check = true

# Disable message retention on this node, i. e. disable Elasticsearch index rotation.
#no_retention = false

# How many Elasticsearch shards and replicas should be used per index? Note that this only applies to newly created indices.
elasticsearch_shards = 4
elasticsearch_replicas = 1

# Prefix for all Elasticsearch indices and index aliases managed by Graylog.
elasticsearch_index_prefix = graylog

# Do you want to allow searches with leading wildcards? This can be extremely resource hungry and should only
# be enabled with care. See also: http://graylog2.org/resources/documentation/general/queries
allow_leading_wildcard_searches = true

# Do you want to allow searches to be highlighted? Depending on the size of your messages this can be memory hungry and
# should only be enabled after making sure your Elasticsearch cluster has enough memory.
allow_highlighting = false

# List of Elasticsearch hosts Graylog should connect to.
# Need to be specified as a comma-separated list of valid URIs for the http ports of your elasticsearch nodes.
# If one or more of your elasticsearch hosts require authentication, include the credentials in each node URI that
# requires authentication.
elasticsearch_hosts = http://172.20.45.29:9200

# Maximum amount of time to wait for successfull connection to Elasticsearch HTTP port.
#elasticsearch_connect_timeout = 10s

# Maximum amount of time to wait for reading back a response from an Elasticsearch server.
#elasticsearch_socket_timeout = 60s

# Maximum idle time for an Elasticsearch connection. If this is exceeded, this connection will
# be tore down.
#elasticsearch_idle_timeout = -1s

# Maximum number of total connections to Elasticsearch.
elasticsearch_max_total_connections = 20

# Maximum number of total connections per Elasticsearch route (normally this means per
# elasticsearch server).
elasticsearch_max_total_connections_per_route = 2

# Enable automatic Elasticsearch node discovery through Nodes Info,
# see https://www.elastic.co/guide/en/elasticsearch/reference/5.4/cluster-nodes-info.html
#
# WARNING: Automatic node discovery does not work if Elasticsearch requires authentication, e. g. with Shield.
#
elasticsearch_discovery_enabled = true

# Filter for including/excluding Elasticsearch nodes in discovery according to their custom attributes,
# see https://www.elastic.co/guide/en/elasticsearch/reference/5.4/cluster.html#cluster-nodes
#
# Default: empty
#elasticsearch_discovery_filter = rack:42

# Frequency of the Elasticsearch node discovery.
#elasticsearch_discovery_frequency = 30s

# Analyzer (tokenizer) to use for message and full_message field. The "standard" filter usually is a good idea.
# All supported analyzers are: standard, simple, whitespace, stop, keyword, pattern, language, snowball, custom
# Elasticsearch documentation: http://www.elasticsearch.org/guide/reference/index-modules/analysis/
# Note that this setting only takes effect on newly created indices.
elasticsearch_analyzer = standard

# Global request timeout for Elasticsearch requests (e. g. during search, index creation, or index time-range
# calculations) based on a best-effort to restrict the runtime of Elasticsearch operations.
# Default: 1m
#elasticsearch_request_timeout = 1m

# Batch size for the Elasticsearch output. This is the maximum (!) number of messages the Elasticsearch output
# module will get at once and write to Elasticsearch in a batch call. If the configured batch size has not been
# reached within output_flush_interval seconds, everything that is available will be flushed at once. Remember
# that every outputbuffer processor manages its own batch and performs its own batch write calls.
# ("outputbuffer_processors" variable)
output_batch_size = 500

# Flush interval (in seconds) for the Elasticsearch output. This is the maximum amount of time between two
# batches of messages written to Elasticsearch. It is only effective at all if your minimum number of messages
# for this time period is less than output_batch_size * outputbuffer_processors.
output_flush_interval = 1

# As stream outputs are loaded only on demand, an output which is failing to initialize will be tried over and
# over again. To prevent this, the following configuration options define after how many faults an output will
# not be tried again for an also configurable amount of seconds.
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

# The number of parallel running processors.
# Raise this number if your buffers are filling up.
processbuffer_processors = 5
outputbuffer_processors = 3

#outputbuffer_processor_keep_alive_time = 5000
#outputbuffer_processor_threads_core_pool_size = 3
#outputbuffer_processor_threads_max_pool_size = 30

# UDP receive buffer size for all message inputs (e. g. SyslogUDPInput).
#udp_recvbuffer_sizes = 1048576

# Wait strategy describing how buffer processors wait on a cursor sequence. (default: sleeping)
# Possible types:
#  - yielding
#     Compromise between performance and CPU usage.
#  - sleeping
#     Compromise between performance and CPU usage. Latency spikes can occur after quiet periods.
#  - blocking
#     High throughput, low latency, higher CPU usage.
#  - busy_spinning
#     Avoids syscalls which could introduce latency jitter. Best when threads can be bound to specific CPU cores.
processor_wait_strategy = blocking

# Size of internal ring buffers. Raise this if raising outputbuffer_processors does not help anymore.
# For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache.
# Start server with --statistics flag to see buffer utilization.
# Must be a power of 2. (512, 1024, 2048, ...)
ring_size = 65536

inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

# Enable the disk based message journal.
message_journal_enabled = true

# The directory which will be used to store the message journal. The directory must me exclusively used by Graylog and
# must not contain any other files than the ones created by Graylog itself.
message_journal_dir = /var/opt/graylog/data/journal

# Journal hold messages before they could be written to Elasticsearch.
# For a maximum of 12 hours or 5 GB whichever happens first.
# During normal operation the journal will be smaller.
#message_journal_max_age = 12h
message_journal_max_size = 500gb

#message_journal_flush_age = 1m
#message_journal_flush_interval = 1000000
#message_journal_segment_age = 1h
#message_journal_segment_size = 100mb

# Number of threads used exclusively for dispatching internal events. Default is 2.
async_eventbus_processors = 2

# How many seconds to wait between marking node as DEAD for possible load balancers and starting the actual
# shutdown process. Set to 0 if you have no status checking load balancers in front.
lb_recognition_period_seconds = 3

# Every message is matched against the configured streams and it can happen that a stream contains rules which
# take an unusual amount of time to run, for example if its using regular expressions that perform excessive backtracking.
# This will impact the processing of the entire server. To keep such misbehaving stream rules from impacting other
# streams, Graylog limits the execution time for each stream.
# The default values are noted below, the timeout is in milliseconds.
# If the stream matching for one stream took longer than the timeout value, and this happened more than "max_faults" times
# that stream is disabled and a notification is shown in the web interface.
# stream_processing_timeout = 2000
# stream_processing_max_faults = 3

# Length of the interval in seconds in which the alert conditions for all streams should be checked
# and alarms are being sent.
alert_check_interval = 60

# Since 0.21 the graylog server supports pluggable output modules. This means a single message can be written to multiple
# outputs. The next setting defines the timeout for a single output module, including the default output module where all
# messages end up. This setting is specified in milliseconds.

# Time in milliseconds to wait for all message outputs to finish writing a single message.
#output_module_timeout = 10000

# Time in milliseconds after which a detected stale master node is being rechecked on startup.
#stale_master_timeout = 2000

# Time in milliseconds which Graylog is waiting for all threads to stop on shutdown.
# shutdown_timeout = 30000

# MongoDB Configuration
mongodb_uri = mongodb://127.0.0.1:27017/graylog

# Raise this according to the maximum connections your MongoDB server can handle if you encounter MongoDB connection problems.
mongodb_max_connections = 100

# Number of threads allowed to be blocked by MongoDB connections multiplier. Default: 5
# If mongodb_max_connections is 100, and mongodb_threads_allowed_to_block_multiplier is 5, then 500 threads can block. More than that and an exception will be thrown.
# http://api.mongodb.org/java/current/com/mongodb/MongoOptions.html#threadsAllowedToBlockForConnectionMultiplier
mongodb_threads_allowed_to_block_multiplier = 5

# Drools Rule File (Use to rewrite incoming log messages)
# See: http://graylog2.org/resources/documentation/general/rewriting

# Email transport
transport_email_enabled = true
transport_email_hostname = mail.XXX.XX
transport_email_port = 2500
transport_email_use_auth = false
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_auth_username = 
transport_email_auth_password = 
transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@XXX.XX

# Specify and uncomment this if you want to include links to the stream in your stream alert mails.
# This should define the fully qualified base url to your web interface exactly the same way as it is accessed by your users.
#
transport_email_web_interface_url = http://graylog-beta

# The default connect timeout for outgoing HTTP connections.
# Values must be a positive duration (and between 1 and 2147483647 when converted to milliseconds).
# Default: 5s
#http_connect_timeout = 5s

# The default read timeout for outgoing HTTP connections.
# Values must be a positive duration (and between 1 and 2147483647 when converted to milliseconds).
# Default: 10s
#http_read_timeout = 10s

# The default write timeout for outgoing HTTP connections.
# Values must be a positive duration (and between 1 and 2147483647 when converted to milliseconds).
# Default: 10s
#http_write_timeout = 10s

# HTTP proxy for outgoing HTTP calls

# Disable the optimization of Elasticsearch indices after index cycling. This may take some load from Elasticsearch
# on heavily used systems with large indices, but it will decrease search performance. The default is to optimize
# cycled indices.
#disable_index_optimization = true

# Optimize the index down to <= index_optimization_max_num_segments. A higher number may take some load from Elasticsearch
# on heavily used systems with large indices, but it will decrease search performance. The default is 1.
#index_optimization_max_num_segments = 1

# Disable the index range calculation on all open/available indices and only calculate the range for the latest
# index. This may speed up index cycling on systems with large indices but it might lead to wrong search results
# in regard to the time range of the messages (i. e. messages within a certain range may not be found). The default
# is to calculate the time range on all open/available indices.
#disable_index_range_calculation = true

# The threshold of the garbage collection runs. If GC runs take longer than this threshold, a system notification
# will be generated to warn the administrator about possible problems with the system. Default is 1 second.
#gc_warning_threshold = 1s

# Connection timeout for a configured LDAP server (e. g. ActiveDirectory) in milliseconds.
#ldap_connection_timeout = 2000

# Enable collection of Graylog-related metrics into MongoDB
#enable_metrics_collection = false

# Disable the use of SIGAR for collecting system stats
#disable_sigar = false

# The default cache time for dashboard widgets. (Default: 10 seconds, minimum: 1 second)
dashboard_widget_default_cache_time = 10s

# Automatically load content packs in "content_packs_dir" on the first start of Graylog.
content_packs_loader_enabled = true

# The directory which contains content packs which should be loaded on the first start of Graylog.
content_packs_dir = /opt/graylog/contentpacks

# A comma-separated list of content packs (files in "content_packs_dir") which should be applied on
# the first start of Graylog.
content_packs_auto_load = grok-patterns.json,content_pack_appliance.json

#2

check the elasticsearch service on your server.

elasticsearch_hosts = http://172.20.45.29:9200

usually this error comes when it can’t reach the elasticsearch.
check system/overview Elasticsearch cluster part. it should be green to work well.


(Jan Doberstein) #3

what is about the available space on the system?

I guess Elasticsearch runs into High-Watermark (will write that into the logfile) and then do not accept messages.

Add more disk space or delete data and everything will work again.


(Peter Schoenegger) #4

Hi Jan, Thanks for your message, really appreciate it! There is a whole bunch of free disk space on the system and it never ran into any full disk issues before.

df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             20G  4.0K   20G   1% /dev
tmpfs           4.0G  480K  4.0G   1% /run
/dev/dm-0        15G  4.9G  9.3G  35% /
none            4.0K     0  4.0K   0% /sys/fs/cgroup
none            5.0M     0  5.0M   0% /run/lock
none             20G     0   20G   0% /run/shm
none            100M     0  100M   0% /run/user
/dev/sda1       236M   81M  143M  37% /boot
**/dev/sdb1       9.9T  4.8T  4.7T  51% /var/opt/graylog/data**

(Peter Schoenegger) #5

Hi, Many thanks for your help on how to fix these problems, really appreciate it! Unfortunately the Elasticsearch cluster is red. What can be done for fixing that? Appreciate your help and time! Best regards, Peter


(Jan Doberstein) #6

one of the best ressources for this topic:

That should guide you to solve that.


(system) #7

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.