Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question. Don’t forget to select tags to help index your topic!
1. Describe your incident:
The graylog container has pegged two cpu cores of the server since yesterday. The mongo and opensearch containers are showing normal cpu usage.
2. Describe your environment:
OS Information: Debian 12
Package Version: 6.0.2 Docker
Service logs, configurations, and environment variables:
Single server instance with everything running in docker
3. What steps have you already taken to try and solve the problem?
Checked logs.
4. How can the community help?
How can I tell why graylog has suddenly started using so much cpu?
Because Graylog had the two cores pinned, the VM backup failed and caused it to be shut down. I’ve restarted the VM and things appear to be back to normal.
I’m not sure if there was just a runaway task or something else. I wasn’t seeing a ton of activity on OpenSearch so I don’t think it was organizing the indices.
each one of those settings creates a thread on the CPU. Depending on amount of load you have coming in. you may want to Increase you CPU to 4 cores. or adjust those setting to match the amount of CPU core you have.
So when it spikes because OpenSearch/Graylog uses the CPU. Those setting I showed above create thread to be used for Processing. The one that is the heavy hitter is
processbuffer_processors = 5
If you add up a those settings above together, it will create 10 threads, normally you want 1 CPU per , so this would be 10 CPU’s you should have. in a normal operation you can get away with those setting but if you have a load it will start using those resource, hence CPU spike.
What do you define as a spike? There were no indications of any sort of spike, hence my confusion. The OS container wasn’t using much cpu and the indexes volume was showing usual levels of activity. The logs didn’t show any sort of index operation being performed nor were there an increase in the amount of logs or change in processing pipelines. The node buffers weren’t filling up, etc.
Except for the two pinned cores, there was no indications of anything out of the ordinary. That’s why I think it was some sort of runaway process but I don’t know how to determine that for sure. It hasn’t happened before or since.
I am using grok patterns in my pipelines, but as I mentioned, none of those had changed recently. I’ve since added more patterns and pipelines and my cpu usage continues to be low.
Ok so nothing has change with your GROK stuff and suddenly your 2 CPU’s get pinned. The only variable I see that could have changed would be what was sent to your Graylog server, meaning log wise. Maybe some random log was sent to Graylog and go stuck in Pipeline, which pinned your CPU’s . The reason I stated this was we had issues with GROK , so we reconfigured for regex instead and never had a issue since. Just an idea.
Understood. Since it hasn’t happened again, I’m not too concerned about what caused it. My main desire right now is how do I diagnose it if it happens again. Figuring out what is causing load on Graylog is much less clear than I’d like.
What do you recommend for standard troubleshooting steps?
Check OpenSearch and/or Graylog logs is the only thing I can think of now. Keep an eye on HTOP/TOP find what service/ app is consuming your CPU. a Metric server like " Zabbix" does help or Grafana. It hard to find this type of issue. I would also look at your log shipper logs to see if there was any clue.
That’s what I did the first time but unfortunately there wasn’t anything that stood out. I know it was java consuming all the cpu in the graylog container but that was it.