This just started happening yesterday afternoon. My Graylog server which has been running almost flawlessly for over a year now has started freaking out. As a first attempt to fix the problem, we turned down log settings across our servers, so we were only getting error level messages and higher. This has demonstrably decreased our overall log input from triple-digits per second to double-digits, but the server is still struggling.
Next we increased the virtual hardware by a tier, now it has 32GB of RAM and 8CPUs, and I’ve set GL and ES heap sizes to use the larger amount of RAM. Performance is somewhat better now, but it’s still struggling. The Journal is starting to increase, and the process buffer is hovering around 60-65%. The only people logged in are admins, and none of us are running queries.
The most obvious thing I see is that the messages In/Out meter is dropping to 0 Out messages and staying there for a minute or more at a time. But the ES server doesn’t seem to be using many resources, other than disk writes.
The ES server is still running on the same VM as Graylog, and we’re actively working to move it onto AWS ES service, but I want to make sure this isn’t caused by something else.
Stupid question, did you make any changes or add any extractors? I recently ran into issues with an extractor that tested fine but filled the output due to the pattern not being correct when applied to the actual messages coming in.
you can temp close indices, it won’t need memory from ES at closed state.
Also check your extractors and pipelines.
You can do simulations with processing time, and check GL’s metrics.
also you can use regexs in streams.
check every metric what you can
what is the number of shards in the ES, and the available RAM? Elasticsearch keeps information of every shard in memory, and if you have too many shards, ES becomes extremely slow, and does not consume much CPU cycles either. This could become apparent later without no changes to Graylog, if the number of shards gradually grows in your system.
@jtkarvo 4 shards is all. This isn’t some huge setup really. We average 10GB of logs per day. We just increased the virtual hardware to deal with this, so it now has 32GB of RAM and 8 VCPUs. We use pipelines pretty extensively, but we’ve been using this setup for over a year now without a hitch.
And as quick as it happened, it stopped. Today the Graylog server is mostly idle, and the Out messages haven’t dropped to 0. Monitoring shows it ran hot for about a day and a half, then just dropped to normal levels again.
As a result of all this we are now working on breaking Elasticsearch off into an AWS ES service cluster and upgrading to Graylog 3 while we’re at it.