I have a single combined Graylog 3.0.0-12 / Elasticsearch 6.6.0 single instance (4vcpu/8GB ram) (Virtualized on Proxmox) where at 10am EST/1500UTC every day Graylog stops processing messages with no error message.
Both Graylog and Elasticsearch have been given 2GB of heap, monitoring shows no memory or heap exhaustion.
At 10am (EST) a single Graylog thread will start spinning at 100% cpu and the process buffer will start filling up, some messages are processed but once the buffer fills up no new messages are written to Elasticsearch. Also once the process buffer is filled up a second Graylog thread starts spinning at 100% cpu.
No log entries (at Debug level) are in Graylog’s server.log, nothing to note in Elasticsearch logs as well.
Data is a few servers’s audit logs in one injest with some Grok matching rules.
Restarting Graylog will allow the messages to be processed (with no apparent dropped messages).
Any insight into further troubleshooting steps or resolving this issue?
check you graylog’s log. (when the daily tasks run)
also check your GL server’s clock, and time settings.
and the buffers’ state.
check the incomeing messages number.
GL does some tasks at the night, eg rotate indices, marge the old, etc, and in this time it stops the processing for a little time. In my systems GL does it UTC 0:00, but it doesn’t match with 10am EST.
Based on your config, it some other problem, it could take more time.
or the old index’s marge slows down your ES, and it can accept only some messages from graylog.
Or one of your clients send a lot of messages, what GL’s can’t handle.
So it happened again today. There is nothing in /var/log/graylog-server/server.log, the last log line is from yesterday.
I do notice something, the traffic IS being processed, both my ins and outs seem to be equal, at least equal enough I don’t notice a different. Output stops when the process buffer fills (likely because nothing new comes in).
Here is a graph showing messages in/out (they overlap) and the rise in the process buffer usage:
turn the page - what does your Elasticsearch do when Graylog can’t process the message?
Check if that has threads available or print anything in the log. It is more likely that Graylog can’t ingest to Elasticsearch that it shows this behaviour.
what I can’t understand…
On the graph, under 6 min your process buffer goes 0 to 18-19k.
But (if the in and out really equal, because on the graph I see only the out’s color) under 6 min you got only 6*250 msgs, so it’s only 1,5k. 18k with 200 msg/s takes one and a half hour.
I think if the ES can’t ingest enough message, the in should be higher then out, and the out should be higher and maybe a constant line. (but of course an ES check can’t hurt.)
First I suggest to “certify” your graph, and double check really contains the valid numbers. (Compare with GL’s WUI many different time)