Uneven distribution of unprocessed messages


Sorry for the delay I went to a wedding.

As for your

If there is an influx of messages and it starts to overwhelm Graylog the journal will fill up quick and it does take some time to recover. Just a side note 12 GB took 1.5 hours to clear.

Elasticsearch should be able to distribute these logs.
Best practices is your buffers should match the number of CPUs in your GL configurations.
The example below is my lab Graylog 4.0 server. This server is ingesting 30GB a day and has 10 GB memory, 12 CPUs.

output_fault_penalty_seconds = 30
processbuffer_processors = 7
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 12gb

So, you have 8 CPUs for each but your ES but configuration it shows

outputbuffer_processors = 5
processbuffer_processors = 10

That would be 15 CPU’s and I don’t know what you have configured for an inputbuffer. What this means is graylog is creating more threads than you have CPU cores.
I’m not 100% sure, but this might one of your issues.

Also I would check to make sure each Graylog configuration file is matching.
The one node with the journal filling up make sure there isnt anything in the way like a firewall.

Unfortunately, I had this problem once and corrected it by adjusting my configuration to match my CPU’s and adding more resources. I was able to do this quickly because all my servers are virtual machines. Then restart graylog services.

I did notice there isn’t an inputbuffer_processors configuration shown.

I would make sure that the processbuffer, outputbuffer, and inputbuffer are matching the amount CPU cores. Judging from the amount of message your trying to ingest. You may need to increase the amount of CPU’s on each elasticsearch node and then adjust your configuration file.

I believe this is a Elasitcsearch problem and the configurations/resources are creating this issue.
By chance are all of your shards evenly distirbuted?


I did find some old post on the same issue you have. They might not be be exactly as what you have but the issues seem similar.

Maybe these might help give you some ideas on how resolve your issue.

And here is something on Elasticsearch.

1 Like