I have three Graylog (3.1.2) nodes (Kubernetes). With unexpected load, the Disk journals were full (with some message are deleted), and then utilization was lower to ~70% with ~ 5,000,000 messages in each of their journals.
I increased the capacity (replaced) the node, the processing seemed good (speed up) for a short time. However a little bit later, one of the journals contains -454,097,877 unprocessed messages, and the other two have a few hundred messages in their journal.
What does this mean?
I replace the node one by one again, the numbers are still the same.
The negative number is slowly going up.
It seems the messages in the journals are lost.
Found a warning in graylog node. It may be related to this issue:
2020-08-07 20:47:54,158 WARN [AbstractTcpTransport] - receiveBufferSize (SO_RCVBUF) for input GELFTCPInput{title=Gelf_TCP_Graylog, type=org.graylog2.inputs.gelf.tcp.GELFTCPInput, nodeId=null} (channel [id: 0xa27b08c9, L:/0.0.0.0:12201]) should be 1048576 but is 425984. - {}
It seems the bad node only taking in messages without process and I cannot figure out how to fix if in the non-prod environment, turn on graylog.journal.deleteBeforeStart and restart the node to bring it up to normal.