Hi, everyone. I’m experiencing some serious performance troubles. Whenever a device I’m monitoring with Graylog (or Graylog itself) is up after a certain period of downtime, Graylog receives a flood of logs that is unable to process.
I’ve experienced this issue two times:
the first time I had to stop all inputs and let Graylog process the queued logs;
the second time (Graylog ran out of disk space and crashed, the problem was solved by expanding the partition) I tried again with the previous approach but it was useless, since no logs are processed anymore. Following some suggestions given by the IRC community I stopped Graylog and deleted the journaling files (there where about 1 million unprocessed logs). This procedure didn’t solve the issue and now Graylog also says “-299,416,322 unprocessed messages are currently in the journal, in 1 segments” (I guess this is due to an integer overflow).
My Graylog installation is currently running on a virtual machine with 2 sockets with 4 cores each and 16 GB of RAM. In both occasions all the cores ran at very low levels and changing Graylog’s configuration file in order to make them able to reach an utilization of 90% and above didn’t help.
did you stop Graylog before you deleted the journal?
If you had deleted all content of the journal it should just get recreated on the next startup.
what does you logs show?
Hi, thanks for the answer. I solved the problem just a few minutes ago by deleting all the files in the journal folder (with Graylog stopped). I read on another forum that the problem seems to be caused by a corrupted journaling file. For the future: do you think there is a way to find the corrupted lines (e.g.: via a script) and remove them in order to lose the lowest possibile number of logs?
Do you have any suggestions on how to improve Graylog’s performance whenever a massive log flood is received and it is not able to process them quickly enough, considering the hardware specifications in my initial post?
rebuild the kafka journal is not trivial. Some tools are available for that but you need to read all entries until the corrupt and then after the corrupt and rebuild the journal …
It will be easier to monitor the Diskspace for the Graylog server and choose a journal size that fits on it. If your environment can contain floods of messages I would consider to run a queue between your incoming messages and Graylog to have a buffer available.
A post was split to a new topic: Journal Message processing