I tried to post about this a while ago and got no answers so i’ll try again.
We have a Graylog cluster with physical servers (quite good specs) that gets full journals every 1 to 3 days and makes the entire cluster slow.
We have tried to expand it from 6 to 9 nodes and also tried the latest Java 8 but it still happens at the same rate as before.
The servers are Debian10 with SSD’s, bonded 10Gb NICS and Java8u251. Backend is 50 Elastic nodes also with Debian10 and bonded 10Gb NIC’s.
Graylog version 3.2.5, MongoDB 4.0.17 and Elastic 6.8.9.
The solution for this right now are Nagios checks that restart Graylog when the journal reaches 500K unprocessed messages and it solves about 95% of the problem (sometimes the check fails when it can’t get the value of the journal from the API)
Anyone have any clues of what causes this and how to fix it ?