I know this seems like a Classic Topic, but I am not able to hunt it down alone… so heres the Situation:
- 2 Nodes Graylog Cluster (Version: 2.5.1)
- 11 Node ES Cluster (3 Master, 8 Datanodes with 2 Instances of ES, so all over 3 Master ans 16Data Nodes) Version: 5.6.14
- about 2500msg/s
Our Problem seems to be that certain messages or maybe even just one message, stops the message Processing on our GraylogNodes. From Time to Time, I don’t see any regularity, one Nodes stops processing its messages. The processbuffer is filled up. The Outputbuffer remains empty. After a restart of the Service via systemctl everything works perfectly again.
I am already monitoring [org.graylog2.shared.buffers.processors.ProcessBufferProcessor.processTime] and I see max ansd mean going up, when the processing comes to stop. In Htop I don’t see any shortage an HW-Resources.
So to me it looks like a massage can’t be processed with a RegEx or Grok. But how do in figure out who is sending therse messages. I tried to figure via the trace log, but also this log simply stops after the last successful processed message.
Any help is highly appreciated!