Hello everyone,
I’m writing to provide an important update on this ongoing issue. Over the weekend, we had a similar repeat of the issue where we had significant spike in our message backlog, which reached over 48 million messages, with zero messages ingested. My initial attempts to address this issue as has always been, was a service reload, but this provided only temporary relief, as only a few hundreds of thousand of messages were successfully ingested while the issue persisted.
The eureka moment came when I decided to take a second look at the extractors. I swung in and deleted all extractors then reloaded the Graylog service. This brought a dramatic improvement, with the backlog being swiftly cleared in under 10minutes - message processing and ingestion peaked at almost 54k msg/s, as indicated in the screenshot provided.
Now that it has been established that the extractors are the biggest debacle, in a bid to permanently resolve this issue, I’m interested in exploring metrics that can help us assess the performance of each individual extractor. We have approximately 15 extractors in place, and understanding their impact on message processing will be crucial. Your insights and suggestions are highly valued.
