Backlog on GL Nodes

Hi

From a few days, we are having more than 10 million backlog on 3 graylog nodes. Tried everything but didn’t find any solution. This is our daily problem now. Message in count is much more than out. Please help me with this.

But I noticed one thing:

  • Somehow output message rates to Elasticserach increases when we Pause our running streams and also backlog started decreasing and come back to normal within an hour. We have gone through the rules(Including RegEx) defined in streams and all are working fine. Directly/Indirectly streams maybe responsible for the backlog on GL nodes.

In my environment, I am using 7 gl nodes, 3 es master nodes and 4 es data nodes.

Graylog Versio - 2.4
Elasticserach Version - 5.6

I have 7 Gl Nodes and each having the same configuration.

Hardware Configuration (for every nodes)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
              total        used        free      shared  buff/cache   available
Mem:            29G         23G        210M        1.4G        5.8G        4.1G
Swap:            0B 

server.conf (for every nodes)

output_batch_size = 5000

# Flush interval (in seconds) for the Elasticsearch output. This is the maximum amount of time between two
# batches of messages written to Elasticsearch. It is only effective at all if your minimum number of messages
# for this time period is less than output_batch_size * outputbuffer_processors.
output_flush_interval = 1

# As stream outputs are loaded only on demand, an output which is failing to initialize will be tried over and
# over again. To prevent this, the following configuration options define after how many faults an output will
# not be tried again for an also configurable amount of seconds.
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

# The number of parallel running processors.
# Raise this number if your buffers are filling up.
processbuffer_processors = 16
outputbuffer_processors = 16

# The following settings (outputbuffer_processor_*) configure the thread pools backing each output buffer processor.
# See https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/ThreadPoolExecutor.html for technical details

# When the number of threads is greater than the core (see outputbuffer_processor_threads_core_pool_size),
# this is the maximum time in milliseconds that excess idle threads will wait for new tasks before terminating.
# Default: 5000
#outputbuffer_processor_keep_alive_time = 5000

# The number of threads to keep in the pool, even if they are idle, unless allowCoreThreadTimeOut is set
# Default: 3
#outputbuffer_processor_threads_core_pool_size = 3

# The maximum number of threads to allow in the pool
# Default: 30
#outputbuffer_processor_threads_max_pool_size = 30

# UDP receive buffer size for all message inputs (e. g. SyslogUDPInput).
#udp_recvbuffer_sizes = 1048576

# Wait strategy describing how buffer processors wait on a cursor sequence. (default: sleeping)
# Possible types:
#  - yielding
#     Compromise between performance and CPU usage.
#  - sleeping
#     Compromise between performance and CPU usage. Latency spikes can occur after quiet periods.
#  - blocking
#     High throughput, low latency, higher CPU usage.
#  - busy_spinning
#     Avoids syscalls which could introduce latency jitter. Best when threads can be bound to specific CPU cores.
processor_wait_strategy = blocking

# Size of internal ring buffers. Raise this if raising outputbuffer_processors does not help anymore.
# For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache.
# Must be a power of 2. (512, 1024, 2048, ...)
ring_size = 262144

inputbuffer_ring_size = 262144
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

# Enable the disk based message journal.
message_journal_enabled = true

JVM Heap size

GRAYLOG_SERVER_JAVA_OPTS="-Xms15g -Xmx15g

es node hardware(for every nodes)

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8

es configs (for every nodes)

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
cluster.name: graylog
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
node.name: es-data01.mykaarma.com
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#

node.master: false
node.data: true
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
network.host: 0.0.0.0
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
discovery.zen.ping.unicast.hosts:["10.0.15.13", "10.0.15.132", "10.0.15.133", "10.0.15.134", "10.0.15.135", "10.0.15.136", "10.0.15.137"]
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
discovery.zen.minimum_master_nodes: 2
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
#bootstrap.mlockall: true
#indices.store.throttle.max_bytes_per_sec: 1024mb

JVM Heap size

-Xms28g
-Xmx28g

Also, this is the Message Processors Configurations:

Please let me know if I missing any configuration.

A quick reply will be appreciated.

Thanks in advance.

1 Like

what might be the reason for this.

Somehow output message rates to Elasticserach increases when we Pause our running streams and also backlog started decreasing and come back to normal within an hour. We have gone through the rules(Including RegEx) defined in streams and all are working fine.

all rules you have in the streams run agains each incoming message - so you might want to lower the complexity for the streams. Rethinking your processing might help with that.

Thanks for the reply @jan
Can you please let me know how can we lower the complexity of streams?

Can you please let me know how can we lower the complexity of streams?

As I do not know your Stream rules, I can’t give any recommendations how to lower the complexity.

But @jan any basic rule to lower the complexity?

We have more than 120 streams running on graylog and most of the stream rule having type Regular Expression

Does type Regular Expression increase complexity?

For Example:
If we use Regular Expression then simply we define stream rule for below requirement as:

Filed: source
type: match regular expression
value: piht01.nvirg.aws.kaar-ma.com|piht02.nvirg.aws.kaar-ma.com|pitc02.nvirg.aws.kaar-ma.com|pitc01.nvirg.aws.kaar-ma.com

And If we don’t use Regular Expression then simply we define stream rule for below requirement as:

Filed: source
type: match exactly
value: piht01.nvirg.aws.kaar-ma.com
value: piht02.nvirg.aws.kaar-ma.com
value: pitc02.nvirg.aws.kaar-ma.com
value: pitc01.nvirg.aws.kaar-ma.com

New stream rule for every source ^^.

It looks like your processing order is:

  1. GeoIP Resolver
  2. Pipeline Processor
  3. Message Filter Chain

Is that accurate? Is there a particular reason you’re going in that order? Are you using pipelines? Are you using message extractors? Are you only routing messages based on stream rules? Are you removing the messages from “All Messages” after you’ve routed them into your streams?

I’m processing with:

  1. Pipeline Processor
  2. Message Filter Chain
  3. GeoIP Resolver

Are you using the GeoIP processor? Do you mean to be processing for geo coordinates on all your messages first? That would seem like a HUGE amount of internal mmdb processing work being done before the messages even get to your streams. I use the GeoIP resolver last to ensure that only the messages that require that type of extra work are getting it at the end.

I use pipelines in a upside down pyramid type fashion with lower number rules being executed first:

As as example:

  • -5 Priority - All messages go against a pipeline with rules designed to drop messages from specific hosts or containing specific information.
  • -4 Priority - All messages go against a pipeline with rules designed to add a field of prod, dev, itest, etc based on the server name pattern.
  • -3 Priority - All messages go against a pipeline with rules designed to add a field of the application name based on certain requirements.
  • -2 Priority - I have pipelines and rules built to process application specific log files in their own pipelines (when has_field(“applicationA”) so that now I’m breaking apart the mass of messages into their actual application specific pipelines.

As messages move down the pipelines, the pipelines are huge and process against all the messages and then they slowly start trickling down into smaller and faster pipelines as the messages get more specific.

What I used to have was large groups of pipelines processing messages all at the same priority level which was killing me performance wise. Having 30 pipelines with 4-5 rules per pipeline all processing the same messages simultaneously was not smart.

I’m not saying your processing order is wrong but I think you really need to think about what is being processed, in what order and really think about the most effective way to run your pipelines / rules / streams. You may want to whiteboard it so that you can visually see the larger pool of messages and how you want to identify and separate the messages and make them flow from large pools into smaller streams.

Cheers!

1 Like

I also have to ask, are you using any type of metrics reporting for graylog specific metrics? I can not recommend enough that you setup either a grafana / prometheus / whatever instance and install the appropriate metrics reporter on all your Graylog nodes. It took me some time to get it all working but this has been the absolute best insight into how my 3 GL nodes are performing. You can look at input buffers, processing buffers and output buffers and really see where you might be getting stuck or when massive message loads are coming in and where. Without a doubt, these tools have helped me improve performance by letting me see things that the normal GL console just doesn’t show.

This is my environment for the last 12 hours.

When you’re dealing with maxed out buffers and millions of messages being back processed this is really useful stuff to know.

Again, Can’t. Recommend. Enough.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.