Hourly Backlog Surge and 100% Buffer Maxed Issue in Graylog

Dear Graylog Community,

I’m currently facing a challenge with Graylog that I’m hoping to get some assistance with. At the top of each hour, I’m experiencing a significant growth in backlog, with the backlog consistently reaching around 400,000 messages. During this period, the process buffer is maxed out at 100%.

What’s particularly interesting is that it takes approximately 9 minutes for all these backlog messages to be consumed. The good news is that once these messages are consumed, the system is free for the next 51minutes. Occasionally, the backlog may grow into as much as 10million messages, and in such very few instances, we reload the graylog service, and it all gets consumed within 30mins without suffering any data loss.

While this sounds okay, I am keen to understand the root cause and explore potential optimizations.

In addition to the above, kindly see the below information about our environment.

  • OS Information: Our Graylog environment runs on Debian - Ubuntu 22.04 with a total of 16 CPU cores / r5.8xlarge AWS instance.

  • Package Version:

elasticsearch                      7.15.2 
graylog-server                     5.0.8-1
mongodb-org                        6.0.7

  • ** Configurations Variables:** This is the only configuration enabled within our Graylog environment:
processbuffer_processors = 10
outputbuffer_processors = 4

I would greatly appreciate any insights or guidance from the community on how to address this issue and ensure smoother operations. Thank you in advance for your assistance!

Okay couple of questions:

  • What RAM do you have assigned to the Graylog jvm?
  • I’m assuming you are running elastic on the same machine. (Just to note you are running an version of elastic not supported by Graylog, so be careful with that one)
  • what is the total volume of messages per day you are putting through this cluster (what’s the GB/day from the overview page)
  • obvious question, but you dont have any snapshots etc running on a schedule on that machine?
  • what does your processing look like, are you running lots of pipelines or extractors on the messages?

Thank you for your prompt response and insightful questions @Joel_Duffield. Here are the details you requested:

  1. Heap Memory Allocation: We have allocated 20GB for the heap memory. Our current settings are as follows:
# Default Java options for heap and garbage collection.
GRAYLOG_SERVER_JAVA_OPTS="-Xms20g -Xmx20g -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow"
  1. You correctly observed that we have all components running on the same instance. I was initially under the impression that we were using compatible Elasticsearch (ES) versions, but now I’d like to confirm. Could you please provide guidance on the recommended Graylog and ES version compatibility?

  2. Our data volume averages about 60GB daily, as demonstrated in the graph below for the past 30 days.

  3. Not at the moment.

  4. Although we have a couple of extractors per stream, they are not time-bound. Consequently, I excluded them as potential reasons for the observed hourly spikes.

I appreciate your assistance and look forward to any further guidance or insights you can provide.

HI @oluseyeo
I suspect you have a spike in ingested logs every full hour. Can you very carefully monitor the values in the top right and tell us the in/out values just at the full hour when the backlog is building up quickly, during it is processed and in the remaining 51 minutes?
Next step would be to find out, which stream is responsible for this behavior.

Thank you for the suggestion, @ihe

I have conducted a thorough analysis through my day, and here are my observations:

  1. Within the Initial 7 Minutes: During this period, the outgoing message count drops to low double digits, with the highest recorded at 40. Incoming log messages, on the other hand, remain consistent, occasionally experiencing minor spikes.

  2. Immediately After the Initial 7 Minutes : For the next 2 minutes after, outgoing message to ES comes back alive, with peak numbers reaching up to 10,000 messages, with the backlog wholly consumed. However, incoming log messages maintain their normal incoming rate as earlier observed within this period.

  3. Backlog Consumed : Over the next 51 minutes, log messages are written to ES almost immediately upon arrival. There are minor spikes in the backlog and process buffer during this period, but they seldom exceed 25% at most.

In additon, I checked the streams during this period and did not identify any streams that appeared to be generating excessive logs. If there is any additional information or specific data that you believe would be beneficial in diagnosing the issue, please do not hesitate to let me know.

@ihe @Joel_Duffield - Any chance to do a review of this further?

interesting. This looks like a bottleneck in the processing/storage during the initial 7 minutes.
There are three buffers: input, processing and output. only the processingbuffer fills up, or also the output buffer?

Only process buffer gets filled, takes a maximum of 65k messages.

There is no response buffer. Click one of your node for details. There you should see this:

Which one fills up? Outbound AND process: problem with elastic/opensearch. Process only: problem with Graylog.

That was an oversight. @ihe Most definitely, the process buffer is what gets filled. Other buffers have not been maxed out.

great, so we know your processing in Graylog lacks some speed in this time. That is interesting.
Do you do parsing, and if yes, in extractors or in Grok patterns? I could imagine some computational expensive timestamps in your logs, which are only difficult to parse at the start of the hour. See also my favorite link about regexploit to find out about this problem.

Yes, the problem points to Graylog for the most part.
The parsing available is only done in the extractors, and there are just a few of them.
Thank you for the link, I would be certain to check it out.

The part that is really giving me pause is how slow it gets and then how fast it recovers. I would assume (and i have seen in real life) processing slowdowns due to a certain extractor etc, but they would normally maintain the same slow throughput all the time. In other words they can only process 500 messages a minute, and anytime the input goes above that the backlog builds until the input drops below 500 and it can catch up. In your cause its almost like everything pauses and then just unpauses.

more ideas:

  • do you have any cronjobs on your graylog-machine which eat up the performance? Or is it a virtual machine and one of the other guest on the same hardware has some hard cronjobs every hour?
  • or do you have an hourly rotation of you indices, and your Graylog is very occupied by rotating them?

You echo my concerns @Joel_Duffield hence why I had raised this topic on the community channel.

@ihe - This is a VM hosted on AWS, and there are no hourly cron jobs running on this node.
The rotation strategy for the indices is hinged on the index size which is set at 30GB. I wouldn’t expect that we max out 30GB every hour, thus the likelihood of this being the reason is slim.

AWS - that is an interesting point and brings up a new idea:

  • what instance type do you use? T2/T3 (maybe others as well) have some fancy “burst” rates and a average baseline performance. Per hour you only can do do and so many minutes of burst, and then they become slow.

Another question about your machine: in the time when it is so slow, is the cpu saturated to 100%?

Can you share metrics for the following?

  • Device load (e.g. load 1, load 5)
  • gl_throughput_input_1_sec_rate
  • gl_throughput_output_1_sec_rate

I noticed in the outgoing screenshot it does show ~60GB/day. However, in another screenshot the 1 second message rate is 15,957/sec, which assuming an average message size of 800 bytes, would be 1.1TB/day. The behavior you describe does sound like message surges.

Do you have any drop rules (rules that allow graylog to receive messages but ultimately discard them?) I ask beacuse those messages still count against all of the input/processing metrics, although I would expect dropped messages to not cause a bottle neck. In any case 1.1TB/day is 18 times more than 60GB/day and if there is a surge of that volume once per hour it makes sense that it would cause a backlog.

@ihe the instance class is the R5

We have observed intermittent spikes in our log traffic during the day. Occasionally, there are restarts of components in our Production environment, leading to increased log generation. In general, however, we maintain a stable and relatively low volume of log traffic throughout the day. To clarify, we have not configured any drop rules.

In response to your request for specific throughput metrics, the exact metrics you mentioned are not available in our new Graylog metrics. To provide you with relevant data, we’ve calculated our daily throughput using the following query: increase(gl_input_throughput{instance="host"}[1d]). The resulting metric is displayed in the attached image below.

Please let me know if you require any further information or have additional questions

Additionally, for your reference, @drewmiranda-gl please see the Device Load metrics

Thank you for those! :pray:

Are you able to zoom in and show a bit more detail for an example time period where the issue is occurring? I’m concerned the chart isn’t showing granular enough data and spikes or dips may be obfuscated.

I’m also curious if you can observe any spikes in your graylog log messages when zoomed in, for example viewing a specific time range of 1 hour. For example, here is an exmaple surge from my graylog environment showing a 1 hour time period:

I’m curious if there is anything similar in your environment.

Are you using the elasticsearch_exporter? That can also provide some insight into the health and behavior of your elasticsearch cluster. What i would look for is to see what the charts look like at the time of (and immediately before/after) the issue occurs to see if there are any correlations. Items that could be helpful: