Hi to all,
I’m using graylog to collect logs from network devices. In particular at the moment I’m collecting debug log from some new access points we have installed to diagnose some issues.
I have around 2500 in messages all the morning hours (08:00-13:30) where most of the usage is, and at the moment I had to extend the journal file to 10gb to avoid loosing any data (I reach the 80-90% utilization around the 13:00).
Looking at the statistic of the node I have 0% of input and output buffer and full 100% of process buffer for many hours.
Reading other topics on this topic I’ve tried tuning some of the parameters but haven’t managed to boost the processing to make it work better.
Here’s the current situation:
Graylog is running on a vm together with elasticsearch and mongodb.
I’m using graylog 5.0.4, elasticsearch 7.10.2 and mongodb 5.0.15 on an ubuntu server 22.04.2
The vm have 8 virtual cores and 24gb of ram.
I’ve assigned 8gb to elasticsearch and 8gb to graylog GC using the jvm parameters.
Talking about graylog process I have this configuration:
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 6
outputbuffer_processors = 1
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 1
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 10gb
I have tried to balance the process number based on my core numbers, as suggested, giving the most to the processbuffer_processors.
I only have a pipeline that is processing the messages with this rule (haven’t find a better way):
rule "filter extreme ap messages"
when
contains(to_string($message.message),"ah_auth: aaa: pmksa_cache_auth_add",true)
|| contains(to_string($message.message),"amrp2: l2routing: set proxy route",true)
|| contains(to_string($message.message),"kernel: [mesh]: set proxy",true)
|| $message.message == "last message repeated 2 times"
then
drop_message();
end
At the moment I can see a current throughput of 1,218 msg/s in the pipeline.
In the journal section I have something like this:
**2,384 messages** have been appended in the last second, **1,674 messages** have been read in the last second.
I’ve called the processbufferdump api many times and I have never seen all the processor full of messages (usually 2-3 is used).
Looking at the cpu usage I can see that is not fully utilized (I see around 25-40% of usage on all the 8 cores) as you can see in this screenshot.
So looking at those data it seam to me that the processor are not fully fed of data like they should I don’t understand why.
Any suggestion of what I can try to touch? Or a way to understand better what’s happening?
Thanks to all
Cheers
Mix