We are using Graylog 3.3.15 and it’s running as docker-compose. From last month every Tuesday 5:30 AM we see high cpu usage on server which auto recovers at evening. This is happening only on every Tuesday.
During high cpu usage we also see below two notfication on Gryalog UI:
Journal utilization is too high and may go over the limit soon. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit
[Journal utilization is too high - Uncommited messages deleted from journal]
We have both graylog and elasticsearch runnign as docker-compose on same node.
These are common logs which could indicate that every Tuesday 5:30 AM this node is probably receiving more logs then normal and Graylog cannot process them quick enough in which the journal fills up which could cause more issue. This maybe indicating a resource issue.
Most members increase the Process buffers to resolve this but it is suggested and also shown in Graylog Documentation that the Buffers ( process, input, output) should not exceed the physical CPU cores on that node. This means if you have the following
Then you should have at least 10 CPU cores. Processors creates a new thread depending on the number set. I also have seen these setting on node with 4 CPU’s, but when there is more data then Graylog an handle , issue arise from this.
If the journal gets to full you could increase the journal size to prevent Elasticsearch going into read mode.
This also may be an issue with the types of messages and or how you are processing during the peak times. Consider that it may be a regex or GROK statement that is inefficient but fine normally but when a complicated or oddly formatted log comes in, it spikes the system trying to deal with it.
if YOU are applying regex or GROK to a message in an extractor or in the Processing Pipeline then you can check those for efficiency. If you post a sample message that you are processing through regex/GROK and the associated regex/GROK someone here can examine it and make suggestions…
If you can, I would suggest adding 2 more CPU cores. Total would be 10 which matches those settings, plus you need some CPU core for you OS, Just a thought.
This would depend on how much logs are being ingested. If you think about it, all 8 Cores are used by Graylog and Elasticsearch so if there is an issue at 5:30 AM this server might not have resources it needs, so its struggling.
Increase the journal is a safety configuration but if there is not enough resource to index those messages/logs it will still have high CPU usage.