I had this problem a few months ago, where for seemingly no reason, log message throughput tanks, CPU % spikes, the Process Buffer fills up, and then the Journal begins filling up. Write-outs just stop for minutes at a time, for no reason I can see. There’s so many metrics to sift through that I’m having issues trying to find meaningful data in them.
I started out with a Graylog server that has been solid for over a year now. We take in maybe 11GB of logs daily, and when I started having this issue the server specs were 4vCPUs, 8G RAM, with 2TB of disk space, all in AWS. running GL, Mongo, and ES on the same box. Now I’ve changed the instance type to C5 (compute optimized) so it’s got 16vCPUs, 32G RAM, and I’ve moved Elasticsearch to its own separate cluster. I even made a new virtual server and installed GL3 fresh. I’ve updated the server.conf to use most of the CPUs for the Process Buffer, and while it’s currently maxing out all 16 CPUs, it’s not helping much, if at all. Still just sits not writing out for seconds to minutes.
ES I can see is mostly idle. CPU around 5% at most. We didn’t preserve any old log data, so there’s only 1 index and it’s not very big yet.
What can I do to troublehoot this? I have very few extractors, the most complicated are the nginx JSON extractors, and they’ve been working for months. I’ve even stopped all the Inputs for minutes as well. Nothing seems to have any real effect on the rate that things are written out to ES. The settings I’ve changed are:
output_batch_size = 10000
processbuffer_processors = 12
outputbuffer_processors = 2
mongodb_max_connections = 2000
I have rather extensive pipelines to sort messages into streams and help with making alerts more granular, but they’ve also all been working for months, and nothing super new has been added leading up to this issue, and frankly I don’t see any time metrics for the how long pipelines take to do their thing.
Thanks very much in advance for any help you can give me.
While I’ve been typing this, for no reason I can figure out, the Process Buffer has gone back to 0, and the journal has processed. At the moment CPU is “only” at ~600%, but at least it seems stable, and it’s no longer at 1400%, so I guess that’s better, but who knows how long till it does this again.