Hi
From a few days, we are having more than 10 million
backlog on 3 graylog nodes. Tried everything but didn’t find any solution. This is our daily problem now. Message in count is much more than out. Please help me with this.
But I noticed one thing:
- Somehow output message rates to Elasticserach increases when we Pause our running streams and also backlog started decreasing and come back to normal within an hour. We have gone through the rules(Including RegEx) defined in streams and all are working fine. Directly/Indirectly streams maybe responsible for the backlog on GL nodes.
In my environment, I am using 7 gl nodes, 3 es master nodes and 4 es data nodes.
Graylog Versio - 2.4
Elasticserach Version - 5.6
I have 7 Gl Nodes and each having the same configuration.
Hardware Configuration (for every nodes)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
total used free shared buff/cache available
Mem: 29G 23G 210M 1.4G 5.8G 4.1G
Swap: 0B
server.conf (for every nodes)
output_batch_size = 5000
# Flush interval (in seconds) for the Elasticsearch output. This is the maximum amount of time between two
# batches of messages written to Elasticsearch. It is only effective at all if your minimum number of messages
# for this time period is less than output_batch_size * outputbuffer_processors.
output_flush_interval = 1
# As stream outputs are loaded only on demand, an output which is failing to initialize will be tried over and
# over again. To prevent this, the following configuration options define after how many faults an output will
# not be tried again for an also configurable amount of seconds.
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
# The number of parallel running processors.
# Raise this number if your buffers are filling up.
processbuffer_processors = 16
outputbuffer_processors = 16
# The following settings (outputbuffer_processor_*) configure the thread pools backing each output buffer processor.
# See https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html for technical details
# When the number of threads is greater than the core (see outputbuffer_processor_threads_core_pool_size),
# this is the maximum time in milliseconds that excess idle threads will wait for new tasks before terminating.
# Default: 5000
#outputbuffer_processor_keep_alive_time = 5000
# The number of threads to keep in the pool, even if they are idle, unless allowCoreThreadTimeOut is set
# Default: 3
#outputbuffer_processor_threads_core_pool_size = 3
# The maximum number of threads to allow in the pool
# Default: 30
#outputbuffer_processor_threads_max_pool_size = 30
# UDP receive buffer size for all message inputs (e. g. SyslogUDPInput).
#udp_recvbuffer_sizes = 1048576
# Wait strategy describing how buffer processors wait on a cursor sequence. (default: sleeping)
# Possible types:
# - yielding
# Compromise between performance and CPU usage.
# - sleeping
# Compromise between performance and CPU usage. Latency spikes can occur after quiet periods.
# - blocking
# High throughput, low latency, higher CPU usage.
# - busy_spinning
# Avoids syscalls which could introduce latency jitter. Best when threads can be bound to specific CPU cores.
processor_wait_strategy = blocking
# Size of internal ring buffers. Raise this if raising outputbuffer_processors does not help anymore.
# For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache.
# Must be a power of 2. (512, 1024, 2048, ...)
ring_size = 262144
inputbuffer_ring_size = 262144
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
# Enable the disk based message journal.
message_journal_enabled = true
JVM Heap size
GRAYLOG_SERVER_JAVA_OPTS="-Xms15g -Xmx15g
es node hardware(for every nodes)
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
es configs (for every nodes)
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: graylog
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: es-data01.mykaarma.com
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
node.master: false
node.data: true
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 0.0.0.0
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts:["10.0.15.13", "10.0.15.132", "10.0.15.133", "10.0.15.134", "10.0.15.135", "10.0.15.136", "10.0.15.137"]
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
#bootstrap.mlockall: true
#indices.store.throttle.max_bytes_per_sec: 1024mb
JVM Heap size
-Xms28g
-Xmx28g
Also, this is the Message Processors Configurations:
Please let me know if I missing any configuration.
A quick reply will be appreciated.
Thanks in advance.