Graylog-server container start consuming high cpu and auto recovers in evening

We are using Graylog 3.3.15 and it’s running as docker-compose. From last month every Tuesday 5:30 AM we see high cpu usage on server which auto recovers at evening. This is happening only on every Tuesday.

During high cpu usage we also see below two notfication on Gryalog UI:

  1. Journal utilization is too high and may go over the limit soon. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit

  2. [Journal utilization is too high - Uncommited messages deleted from journal]

We have both graylog and elasticsearch runnign as docker-compose on same node.

Hello && Welcome @kcgochar

I might be able to shed some light on this issue.

These are common logs which could indicate that every Tuesday 5:30 AM this node is probably receiving more logs then normal and Graylog cannot process them quick enough in which the journal fills up which could cause more issue. This maybe indicating a resource issue.

Most members increase the Process buffers to resolve this but it is suggested and also shown in Graylog Documentation that the Buffers ( process, input, output) should not exceed the physical CPU cores on that node. This means if you have the following

processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2

Then you should have at least 10 CPU cores. Processors creates a new thread depending on the number set. I also have seen these setting on node with 4 CPU’s, but when there is more data then Graylog an handle , issue arise from this.
If the journal gets to full you could increase the journal size to prevent Elasticsearch going into read mode.

message_journal_max_size = 12gb

Hope that helps

This also may be an issue with the types of messages and or how you are processing during the peak times. Consider that it may be a regex or GROK statement that is inefficient but fine normally but when a complicated or oddly formatted log comes in, it spikes the system trying to deal with it.

1 Like

Thank you @tmacgbay. Could you please let me know how can we check if incoming messages are regex of GROK statement ?

thank you @gsmith for contributing…

We have 8 core CPU and already have below settings-
processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2

We have increased the message_journal_max_size from 5 GB to 12 GB. Now we will check if that helps.

Also, can you please let know how can we check which application stream is writing more logs on each Tuesday same time.

if YOU are applying regex or GROK to a message in an extractor or in the Processing Pipeline then you can check those for efficiency. If you post a sample message that you are processing through regex/GROK and the associated regex/GROK someone here can examine it and make suggestions…

Hello @kcgochar

If you can, I would suggest adding 2 more CPU cores. Total would be 10 which matches those settings, plus you need some CPU core for you OS, Just a thought.
This would depend on how much logs are being ingested. If you think about it, all 8 Cores are used by Graylog and Elasticsearch so if there is an issue at 5:30 AM this server might not have resources it needs, so its struggling.

Increase the journal is a safety configuration but if there is not enough resource to index those messages/logs it will still have high CPU usage.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.