Graylog stops processing messages after log flood

Mario · March 31, 2017, 1:09pm

Hi, everyone. I’m experiencing some serious performance troubles. Whenever a device I’m monitoring with Graylog (or Graylog itself) is up after a certain period of downtime, Graylog receives a flood of logs that is unable to process.
I’ve experienced this issue two times:

the first time I had to stop all inputs and let Graylog process the queued logs;
the second time (Graylog ran out of disk space and crashed, the problem was solved by expanding the partition) I tried again with the previous approach but it was useless, since no logs are processed anymore. Following some suggestions given by the IRC community I stopped Graylog and deleted the journaling files (there where about 1 million unprocessed logs). This procedure didn’t solve the issue and now Graylog also says “-299,416,322 unprocessed messages are currently in the journal, in 1 segments” (I guess this is due to an integer overflow).

Hardware info
My Graylog installation is currently running on a virtual machine with 2 sockets with 4 cores each and 16 GB of RAM. In both occasions all the cores ran at very low levels and changing Graylog’s configuration file in order to make them able to reach an utilization of 90% and above didn’t help.

jan · March 31, 2017, 2:04pm

hej @Mario

did you stop Graylog before you deleted the journal?

If you had deleted all content of the journal it should just get recreated on the next startup.

what does you logs show?

Mario · March 31, 2017, 2:53pm

Hi, thanks for the answer. I solved the problem just a few minutes ago by deleting all the files in the journal folder (with Graylog stopped). I read on another forum that the problem seems to be caused by a corrupted journaling file. For the future: do you think there is a way to find the corrupted lines (e.g.: via a script) and remove them in order to lose the lowest possibile number of logs?

Do you have any suggestions on how to improve Graylog’s performance whenever a massive log flood is received and it is not able to process them quickly enough, considering the hardware specifications in my initial post?

jan · April 3, 2017, 9:45am

@Mario

rebuild the kafka journal is not trivial. Some tools are available for that but you need to read all entries until the corrupt and then after the corrupt and rebuild the journal …

It will be easier to monitor the Diskspace for the Graylog server and choose a journal size that fits on it. If your environment can contain floods of messages I would consider to run a queue between your incoming messages and Graylog to have a buffer available.

jan · June 10, 2017, 10:26am

A post was split to a new topic: Journal Message processing

Topic		Replies	Views
Message processing - Graylog 3.1 Graylog Central (peer support)	3	412	July 1, 2020
Graylog doesn't process messages Graylog Central (peer support)	2	427	October 16, 2020
Graylog suddenly stops processing messages Graylog Central (peer support)	3	1398	November 8, 2017
Graylog stops processing messages Graylog Central (peer support)	2	869	March 9, 2020
Graylog stops processing messages seemingly at random times Graylog Central (peer support)	7	1898	June 19, 2020

Graylog stops processing messages after log flood

Related topics