Hi,
I am currently experiencing an issue related to a high number of unprocessed messages in my Graylog cluster.
Within a very short period of time (approximately 1 minute), the number of unprocessed messages can spike to hundreds of thousands, even though all nodes are in a RUNNING state and message processing is enabled.
Environment:
-
Graylog cluster (multi-node setup)
-
All nodes status: Running, Load Balancer Alive
-
JVM heap usage appears normal (approximately 1–4 GB per node)
-
Journal messages continue to accumulate
Issue:
-
Sudden spike in unprocessed messages
-
Processing throughput is unable to keep up with incoming log volume
-
Potential delay in log visibility and alerting
Questions:
-
What are the most common root causes for this behavior?
-
What is the recommended approach to identify the bottleneck (e.g., CPU, disk I/O, journal, or input rate)?
-
What tuning steps are recommended to reduce the number of unprocessed messages?
-
Would it be more appropriate to scale horizontally (add more nodes) or optimize the existing configuration first?
