Node will not rejoin cluster after deleting journal

I’m running a 3 node graylog cluster on Ubuntu 16.04 LTS with a 3 node ES cluster on the same.
Not having disk space issues but the node journals have been filing up over the past few days with zero messages being output to ES.

I’ve deleted the /var/log/graylog/server.log file which gets them working for almost a day.
We’ve checked CPU and RAM and they are fine on GL and ES nodes.
We’ve gone through and checked our extractors and cleaned up any that were slow.
We’ve done reboots on all Graylog nodes.

Common advice seems to be to delete the journal and restart the Graylog service but, whenever I do that, the Graylog service will not start or rejoin the cluster. (Fortunately, rather than doing ‘rm’ is simply renamed the journal folder and created a new blank one.)
Swapping the original journal folder back in and restarting allows the node to rejoin the cluster.

There’s a message about Kafka not being able to get access to the ‘/var/lib/graylog-server/journal/.lock’ file.That there is another process accessing it. There is no such process shown by ‘lsof’.

Looks like this; 2017-10-23T12:28:15.083-07:00 ERROR [CmdLineTool] Guice error (more detail on log level debug): Error injecting constructor, java.lang.RuntimeException: kafka.common.KafkaException: Failed to acquire lock on file .lock in /var/lib/graylog-server/journal. A Kafka instance in another process or thread is using this directory

Is there another step after deleting the journal that I’m missing?

No, but you should only delete the journal files when Graylog is stopped and restart it afterwards.

Also make sure that there’s currently no other Graylog instance running which might still lock the journal directory.
When in doubt, check if there’s any Java process running after you’ve stopped Graylog and stop them with kill or killall.

I have been stopping graylog before deleting the journal and checking that there is no other ‘java’ process running. There never has been.
lsof | grep journal shows nothing accessing the folder or files in it.

but, really I’m not deleting it… i’m just renaming the folder to .bak and creating an empty one with the same permissions.This is such a common thing to do on *nix I never thought of it. Is it a problem?
(NVM, I’ll just try moving it to a diff folder entirely this time and find out for myself…)

Thanks very much for the suggestions!

For anyone else experiencing the Kafka .lock file issue when trying to clear the Graylog journal:

You must actually delete the journal folder messagejournal-X or move it to a completely different folder.
Renaming it and leaving it in the /var/lib/graylog/journal folder (or whatever your path is) causes the Kafka .lock file issue.

One interesting result is that now the node displays -371k messages. How can you have negative messages?

The number doesn’t seem to be going down (or up rather…). I tried deleting server.log and restarting but that did not clear it.

Any ideas folks?

Also see http://docs.graylog.org/en/2.3/pages/faq.html#dedicated-partition-for-the-journal :wink:

Stop Graylog, remove the journal directory (or move it to another directory), then recreate a directory with the same permissions and make sure that it’s empty (also “hidden” files starting with a dot).

I have a /var partition where the journal lives at the moment. If it fills the disk repeatedly (causing me to drink too much tequila), I’ll move it to it’s very own (being careful to create subfolder as suggested.)

So for the record: if you’re running more than 1 graylog node, it’s important to stop ALL of them at once, delete the journals on all of them, then start them up.

In this case, I just started the master node first and am seeing it process thru the -370,000 messages it had stockpiled. Not sure what that number represents now that the journal has been deleted but, the node is grinding it’s way up to 0 so something or other is happening.

Ok I feel kinda stupid now but, hey, still a bit of a noob to Graylog
(Graylog is awesome btw… it’s doing a great job for us! So if any contributors read this, thanks for building such a great project!)

I re-read @jochen’s suggestion more carefully along with this conversation in the Google Group: https://groups.google.com/forum/#!topic/graylog2/puFYLLCEoIw and realized that I was removing the messages directory inside the journal directory which is wrong. You have to remove everything inside the journal directory because there are files in there which track the message count:
$ cat graylog2-committed-read-offset 329340792 $ cat graylog2-committed-read-offset 329340792
So of course, leaving those files in place but deleting the messages would make the count negative.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.