Sudden Spike of Unprocessed Messages in Graylog Cluster

Hi,

I am currently experiencing an issue related to a high number of unprocessed messages in my Graylog cluster.

Within a very short period of time (approximately 1 minute), the number of unprocessed messages can spike to hundreds of thousands, even though all nodes are in a RUNNING state and message processing is enabled.

Environment:

  • Graylog cluster (multi-node setup)

  • All nodes status: Running, Load Balancer Alive

  • JVM heap usage appears normal (approximately 1–4 GB per node)

  • Journal messages continue to accumulate

Issue:

  • Sudden spike in unprocessed messages

  • Processing throughput is unable to keep up with incoming log volume

  • Potential delay in log visibility and alerting

Questions:

  1. What are the most common root causes for this behavior?

  2. What is the recommended approach to identify the bottleneck (e.g., CPU, disk I/O, journal, or input rate)?

  3. What tuning steps are recommended to reduce the number of unprocessed messages?

  4. Would it be more appropriate to scale horizontally (add more nodes) or optimize the existing configuration first?


Check the process and output buffer utilization at the point the build up occurs, if it’s just the process buffer hitting 100% then the issue could be with pipeline rules and if both output and process buffer are filled then the issue will be with writing messages to the Opensearch cluster.

  1. one input is sending more logs than usual. you can check on the input page, with input has so many logs

  2. I recommend the steps by @Wine_Merchant to find the bottleneck: only the processing buffer is full → processing on the Graylog is slow. If the out butbuffer is also full the OpenSearch is the bottleneck

  3. if you can identify the messages which cause the spike, you can decide if they are relvant for you, or not. I know cases, where a drop_message() for certain types of messages brings a big boost in performance, as this message is not even written into OpenSearch

  4. first optimize, what you already have. To do so:

  • check your parsing. Greedy Grok patterns can realy burn a lot of CPU. I wrote a blog post on that (in German though, but any AI Translation will help): Grok Pattern und Graylog: Effizientes Log-Parsing ohne Bottlenecks - NetUSE AG
  • Check your lookups. If you do e. g. reverse DNS and the timeout is 10s, reduce the timeout. Check also the size of your caches. If the thoughput of the cache is high but the hit percentage low, increase your cache
  • Check your Alerts: running an alert every 5 sec to search a heavy data stream with the data of the last week will degrade the performance of your OpenSearch