Hourly Backlog Surge and 100% Buffer Maxed Issue in Graylog

oluseyeo · November 7, 2023, 1:38am

Hello everyone,

I’m writing to provide an important update on this ongoing issue. Over the weekend, we had a similar repeat of the issue where we had significant spike in our message backlog, which reached over 48 million messages, with zero messages ingested. My initial attempts to address this issue as has always been, was a service reload, but this provided only temporary relief, as only a few hundreds of thousand of messages were successfully ingested while the issue persisted.

The eureka moment came when I decided to take a second look at the extractors. I swung in and deleted all extractors then reloaded the Graylog service. This brought a dramatic improvement, with the backlog being swiftly cleared in under 10minutes - message processing and ingestion peaked at almost 54k msg/s, as indicated in the screenshot provided.

Now that it has been established that the extractors are the biggest debacle, in a bid to permanently resolve this issue, I’m interested in exploring metrics that can help us assess the performance of each individual extractor. We have approximately 15 extractors in place, and understanding their impact on message processing will be crucial. Your insights and suggestions are highly valued.

drewmiranda-gl · November 7, 2023, 10:37pm

Wow great work and great find!

Regarding metrics within graylog, this page is helpful:
https://go2docs.graylog.org/5-2/interacting_with_your_log_data/metrics.html

You can browse the metrics of a node by going to the System / Nodes page and then click ing on the “Metrics” button for the applicable node.

You can also configure graylog to expose these metrics in prometheus exporter compatible format so that you can collect these with prometheus and view them over time in grafana. You can also specify additional metric mappings to be exposed. This is required since only a limited subset of graylog metrics are published by default.

When specifying metric_mappings via a custom mappings file you don’t need to specify the full metric path. You can specify a partial metric name:

For example, this metric:

is specified as

  - metric_name: "streams_incommingmsgs"
    match_pattern: "org.graylog2.plugin.streams.Stream"

in the custom mappings file.

Hope that helps.

oluseyeo · November 7, 2023, 10:56pm

Thank you, Drew and to everyone who has contributed.

ihe · November 13, 2023, 2:01pm

54k msg/s is impressive
what kind of extractors are you using? could it help with your computational bottleneck to change them into pipline rules? With rules, you can deice which kind of message should be processed by e.g. a regex. Checking the computationally cheap condition for all messages might be cheaper than the computationally expensive regex.

oluseyeo · November 21, 2023, 1:28am

@ihe - thank you. The extractors are mostly regex based.
I would look into the possibility of converting these to pipelines and evaluate how much improvement it brings.

system · December 5, 2023, 1:29am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Process Buffer Flooding 100% process Graylog Central (peer support)	8	4663	May 7, 2020
Buffer utilization is 100% for all nodes having backlog Graylog Central (peer support)	18	5300	October 18, 2018
Again Graylog is backed up and slow to write out Graylog Central (peer support)	2	880	April 20, 2019
Graylog missing messages, fluctuates between 20k and up to 100k per minute Graylog Central (peer support)	6	1633	November 29, 2017
Process and output buffer is 100% utilized Graylog Central (peer support)	5	9392	July 26, 2018

Hourly Backlog Surge and 100% Buffer Maxed Issue in Graylog

Related topics