I’m writing to provide an important update on this ongoing issue. Over the weekend, we had a similar repeat of the issue where we had significant spike in our message backlog, which reached over 48 million messages, with zero messages ingested. My initial attempts to address this issue as has always been, was a service reload, but this provided only temporary relief, as only a few hundreds of thousand of messages were successfully ingested while the issue persisted.
The eureka moment came when I decided to take a second look at the extractors. I swung in and deleted all extractors then reloaded the Graylog service. This brought a dramatic improvement, with the backlog being swiftly cleared in under 10minutes - message processing and ingestion peaked at almost 54k msg/s, as indicated in the screenshot provided.
Now that it has been established that the extractors are the biggest debacle, in a bid to permanently resolve this issue, I’m interested in exploring metrics that can help us assess the performance of each individual extractor. We have approximately 15 extractors in place, and understanding their impact on message processing will be crucial. Your insights and suggestions are highly valued.
You can also configure graylog to expose these metrics in prometheus exporter compatible format so that you can collect these with prometheus and view them over time in grafana. You can also specify additional metric mappings to be exposed. This is required since only a limited subset of graylog metrics are published by default.
When specifying metric_mappings via a custom mappings file you don’t need to specify the full metric path. You can specify a partial metric name:
54k msg/s is impressive
what kind of extractors are you using? could it help with your computational bottleneck to change them into pipline rules? With rules, you can deice which kind of message should be processed by e.g. a regex. Checking the computationally cheap condition for all messages might be cheaper than the computationally expensive regex.
@ihe - thank you. The extractors are mostly regex based.
I would look into the possibility of converting these to pipelines and evaluate how much improvement it brings.