Slower processing of messages after upgrade to Graylog 6.x

1. Describe your incident:
Few weeks ago I upgraded from version 5.2 to 6.x. All other things stayed unchanged…graylog confguration, mongodb. opensearch, the amount of incoming messages is the same. Immediately after upgrade I noticed that 3 graylog servers are processing with a bit lowr efficiency, they are unable to fully process in time the full traffic that has around 50 000 msgs/sec and the journal queues grow very large, over 100 milions. Dont see error about in graylog logfiles .
During night ammount of messages is less and at that time journal queues are empty - processing is done quick enough.

2. Describe your environment:

  • OS Information:
    Oracle Linux 8.7
  • Package Version:
    Before upgrade graylog 5.2.3, after upgrade 6.0.2
  • Service logs, configurations, and environment variables:
    1 Load Balncer infornt of 3 powerful graylog servers , each over 50 CPUs,
    6 opensearch nodes

3. What steps have you already taken to try and solve the problem?

Played a bit with setting of buffers, as I saw this was changesd in Documnetation for 6.0, But as i understand if user sets his own values for processbuffer_processors and outputbuffer_processors they should still be in charge and work as before ?

Automatically choose default number of process-buffer and output-buffer processors based on available CPU cores. graylog2-server#17450graylog2-server#17737

Our setting before were large but were set so based on best results with previous versions
processbuffer_processors = 26
outputbuffer_processors = 16

I played a bit with these values, tried to lower or increase them, but it seem s it didnt matter much. I monitored metrics for Processor buffer in GUI like [org.graylog2.shared.buffers.processors.ProcessBufferProcessor.incomingMessages] to see if the changedf buffer setting orioduces better througput.

The workaround for now is following : So once journal queue on one graylog node gets too larg, we instruct LB to move messages to other 2 and give that node time to recover.

4. How can the community help?

I would really prefer to stay on 6.x version. Am I looking at right reason regardinig buffers. to check inside metrics.

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

How do the process and output buffers look on each of those nodes?

Process buffers values are higher, at 100% most of the time, But looking at them right now, they do go down around 70% here and there. Output buffers are different , changing values more active , after few seconds, from 30% till sometimes 100%.

Process buffer

100.00%

65537 messages in process buffer, 100.00% utilized.

Output buffer

88.73%

58150 messages in output buffer, 88.73% utilized.

Example of metrics from Process buffer :

[org.graylog2.shared.buffers.processors.ProcessBufferProcessor.incomingMessages]org.graylog2.buffers.process.|processbuffer#)

Meter

Total:
202,565,015 events
Mean:
9,943.17 events/second
1 minute avg:
14,042.31 events/second
5 minute avg:
13,709.7 events/second
15 minute avg:
13,676.15 events/second

I dont remember looking so much into detail of these metrics but before upgrade it was clear that each graylog server could on its own process at least 25 000 events per second. Now as I watch these metrics after upgrade I never saw larger number than 18 000events/sec.

Another test that I did was that: I stopped proceesing (sending to opensearch) on 2 other nodes and let just one graylog node send messages to opensearch . The mentioned metrics did not improve…it stayed in 15 000 - 18000 range.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Ideally all those buffers should be near zero most of the time, so you are definitly pushing that cluster to the limit.

We generally saw an increase in performance from customers moving from 5 to 6 so that is stange to see a decrease.

Graylog is built to scale horizontally rather than vertically so we would generally recommand more smaller servers like 16 core 16gb ram, rather than a few beefy ones. It’s not that scaling up cant help, but its not really architected or tested to help if that makes sense.

What kind of processing are you doing, pipelines, extractors, regex etc?