I’m running a 3 node graylog cluster on Ubuntu 16.04 LTS with a 3 node ES cluster on the same.
Not having disk space issues but the node journals have been filing up over the past few days with zero messages being output to ES.
I’ve deleted the /var/log/graylog/server.log file which gets them working for almost a day.
We’ve checked CPU and RAM and they are fine on GL and ES nodes.
We’ve gone through and checked our extractors and cleaned up any that were slow.
We’ve done reboots on all Graylog nodes.
Common advice seems to be to delete the journal and restart the Graylog service but, whenever I do that, the Graylog service will not start or rejoin the cluster. (Fortunately, rather than doing ‘rm’ is simply renamed the journal folder and created a new blank one.)
Swapping the original journal folder back in and restarting allows the node to rejoin the cluster.
There’s a message about Kafka not being able to get access to the ‘/var/lib/graylog-server/journal/.lock’ file.That there is another process accessing it. There is no such process shown by ‘lsof’.
Looks like this;
2017-10-23T12:28:15.083-07:00 ERROR [CmdLineTool] Guice error (more detail on log level debug): Error injecting constructor, java.lang.RuntimeException: kafka.common.KafkaException: Failed to acquire lock on file .lock in /var/lib/graylog-server/journal. A Kafka instance in another process or thread is using this directory
Is there another step after deleting the journal that I’m missing?