Optimize Process buffer at 100%

Hi to all,

I’m using graylog to collect logs from network devices. In particular at the moment I’m collecting debug log from some new access points we have installed to diagnose some issues.
I have around 2500 in messages all the morning hours (08:00-13:30) where most of the usage is, and at the moment I had to extend the journal file to 10gb to avoid loosing any data (I reach the 80-90% utilization around the 13:00).
Looking at the statistic of the node I have 0% of input and output buffer and full 100% of process buffer for many hours.

Reading other topics on this topic I’ve tried tuning some of the parameters but haven’t managed to boost the processing to make it work better.

Here’s the current situation:

Graylog is running on a vm together with elasticsearch and mongodb.
I’m using graylog 5.0.4, elasticsearch 7.10.2 and mongodb 5.0.15 on an ubuntu server 22.04.2
The vm have 8 virtual cores and 24gb of ram.
I’ve assigned 8gb to elasticsearch and 8gb to graylog GC using the jvm parameters.
Talking about graylog process I have this configuration:

output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 6
outputbuffer_processors = 1
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 1
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 10gb

I have tried to balance the process number based on my core numbers, as suggested, giving the most to the processbuffer_processors.

I only have a pipeline that is processing the messages with this rule (haven’t find a better way):

rule "filter extreme ap messages"
when
    contains(to_string($message.message),"ah_auth: aaa: pmksa_cache_auth_add",true) 
    || contains(to_string($message.message),"amrp2: l2routing: set proxy route",true) 
    || contains(to_string($message.message),"kernel: [mesh]: set proxy",true) 
    || $message.message == "last message repeated 2 times"
then
  drop_message();
end

At the moment I can see a current throughput of 1,218 msg/s in the pipeline.
In the journal section I have something like this:

**2,384 messages** have been appended in the last second, **1,674 messages** have been read in the last second.

I’ve called the processbufferdump api many times and I have never seen all the processor full of messages (usually 2-3 is used).

Looking at the cpu usage I can see that is not fully utilized (I see around 25-40% of usage on all the 8 cores) as you can see in this screenshot.

So looking at those data it seam to me that the processor are not fully fed of data like they should I don’t understand why.
Any suggestion of what I can try to touch? Or a way to understand better what’s happening?

Thanks to all
Cheers
Mix

All the other tuning aside, you can definitely optimize for what you are doing in that single rule. I am not sure if rules are smart enough to stop parsing on the firs OR match, but lets assume that they don’t for worst case scenario… if you think of contains() as a regex call, that is 4 regex activities on each message going through… and the regex is not optimized so it slides the match along the entire message looking for a match in each case… that would mean that each regex would itself iterate some number of times to match (or not) before letting the next regex (contains()) would look at it.

Since rules run semi parallel inside each stage, you ~could~ optimize by making a rule for each contains() → drop_message() and put them in sequential rules with the most likely rules happening first.

theoretically…

1 Like

Thanks for the answer @tmacgbay :smiley:

I’ve tried split the rules in 4 like you suggested and even tried to switch from contains() to starts_with() that I hope perform better.
Something like this:

rule "filter extreme ap pmksa_cache_auth_add"
when
    starts_with(to_string($message.message),"ah_auth: aaa: pmksa_cache_auth_add")
then
  drop_message();
end

I’ve set the “Continue processing on next stage when” option to “At least one of the rules on this stage matches the message” that I think is what is to be set to have the “short circuit” of the rules.

Here’s the result:

I don’t understand well how the throughput on the single rule work… but int general I’m not seeing a large improvement in the total throughput (it’s still around 1500 mgs/s).

There’s any parameter I can tweak to manage how many message are red from the journal log?
How can I understand if the bottleneck is the disk usage?

Thanks again
Daniele

You would definitely want to do more tuning for memory/processors, I am not as good as @gsmith is on that… I would literally look back through other community posts to figure that out… it is definitely out there, @gsmith answered one recently.

For the rules break out - Stages pretty much run all rules in the stage at the same time… so your four rules in the same stage all run semi-parallel and do not pair down on the total of what is running. Instead you will want to have each rule in sequential stages (stage 0, stage 1, stage 2…) with continue processing set to “None or more rules on this stage match” so that it always progresses to the next stage. There was a change a while back that would get drop_message() to stop processing stages if it was used. So if the first rule in stage 0 found a match and dropped the message, the message would not have to run through any subsequent stages.

All this won’t help when you can optimize the java memory, processing settings, and test for other bottlenecks. You did mention that everything is running on a single machine… if you separate out Elasticsearch (you should consider upgrading to Opensearch) to a separate machine, you can better judge the disk load… based on your point that it is a processing buffer issue… are there any other rules that you have that may be work intensive? Rules with regex/GROK can really slow down processing if they aren’t efficient with what they are doing.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.