After some hilarity with processbuffers getting stuck and no longer outputting, I’m now looking at 3 graylog nodes with 300 million messages in the journal each, with no output to ES. All process buffers are empty/idle (according to processbuffer dump).
A restart of the graylog server doesn’t clear the issue, however, this shows up in the logs:
2022-08-29T07:10:56.077Z ERROR [ServiceManager] Service JournalReader [FAILED] has failed in the RUNNING state.
java.lang.NullPointerException: null
at org.graylog2.shared.utilities.ByteBufferUtils.readBytes(ByteBufferUtils.java:28) ~[graylog.jar:?]
at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:609) ~[graylog.jar:?]
at org.graylog2.shared.journal.KafkaJournal.read(KafkaJournal.java:567) ~[graylog.jar:?]
at org.graylog2.shared.journal.JournalReader.run(JournalReader.java:139) ~[graylog.jar:?]
at com.google.common.util.concurrent.AbstractExecutionThreadService$1$2.run(AbstractExecutionThreadService.java:66) [graylog.jar:?]
at com.google.common.util.concurrent.Callables$4.run(Callables.java:119) [graylog.jar:?]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_242]
Only way to fix this it seems is to remove the journal and start fresh.
From what I have been reading, the NullPointerException leads me to believe the journal is corrupt but I can’t find anything that says that solidly. Its a stretch but you could stop the server, move the journal data out, start Graylog, verify its running smooth then turn it off and swap the old data back in again? Seems this may be implied that you have done it. @dscryber could see if there are any Graylog engineers around that might have worked with this before…
I have indeed moved the journal out of the way and restarted the Graylog instances, and they started okay, then got bogged down again where the process buffers stay full (an older, other issue I’ve had), but it’s sort of fixed itself. It gets stuck occasionally, seems some apps send it one message the pipeline processing just doesn’t like.
This reminded me of a similar issue I had a while back where I Graylog would lock up and I could see things listed in a process buffer dump (“I see dead things”) the post was here - hopefully that will have some info that will help (not with corrupt journal tho…) In short I believe it was a GROK where I had %{IP:iclient_ip} but some messages had 'localhost' so GROK choked… changing it to the more inclusive %{IPORHOST:client_ip} made all the difference.
We don’t really use grok patterns anymore, I’ve managed to get our developers to log everything in JSON so all we do is extract it (which, well, I’ve got issues with some of that too but that’s another story altogether) and run pipelines to massage it a bit more.
Service JournalReader [FAILED] has failed in the RUNNING state.java.lang.NullPointerException: null
Not sure if this will help, but I have seen this resolved by upgrading JAVA and/or increasing JVM heap.
But most of the time its some type of GROK/REGEX not correct.