Message Journal not being processed

Hi Everyone,

Like the title says, my message journal does fillup during peak usage which is normal. When there is minimal usage, I see almost no messages being processed by the configured pipelines and the journal message count keeps increasing against the low and slow stream of messages.

I don’t see any errors or anything unusual both in graylog and elasticsearch logs. The process,input and output buffers are all at 0% usage but the disk journal continues to increase with 116,636,921 plus unprocessed messages.

My message journal settings are below, is this sane?

message_journal_max_age = 48h
message_journal_max_size = 40gb

message_journal_flush_age = 1m
#message_journal_flush_interval = 1000000
message_journal_segment_age = 1m
message_journal_segment_size = 100mb

Made sure that message processing hasn’t be paused by mistake?

you should also check you elasticsearch, and the connection between the GL and ES

message processing was not paused,however I found it paused this morning once it hit 100% utilization.
I’ve decided to purge the message journal so I don’t end up losing more logs. The ES cluster is healthy/green with no unassigned shards or log messages. I did recently double the core count on the node,but I did not increase shards for the high-volume index-sets (which I’ve since done, and rotated the active write index).

For volume context, I normally see 5-800 messages/sec on off-hours. normal business day means 2-5k messages/sec with unpredictable bursts of up to 50K messages/sec that can last an hour!

I have a one-node setup currently with the ES server and graylog sharing resources (this is temporary and might change after proving the concept a bit more). Graylog-server has 4GB while ES has 8GB main-memory.

Any tips are appreciated.

Check the in/out massages number at the right top on your web UI.
Maybe your graylog/elastic can’t handle this amount of message.
How many CPU core do you have in GL and ES servers?
Check you processor settings in GL’s server.conf.
My GL servers can handle 20kEPS/node with 8core, 16 GB MEM. At the ES site… I have enough resource… It’s with default settings.

I can suggest some tests…
Pause the processing, disable input messages, and resume processing. Check you output max rate without any incomeing messages. After repeat is with incomeing messages. Or send some more messages at off-hours. Try to increase you message rate to 1kEPS,5k,10k… Check where do you see any limit where the yournal goes up.
Also System-nodes-your GL node. You will find 3 buffers, which ones goes full? If the output your ES can’t handle the traffic, and it cause full processor buffer also.
If just the processor buffer, your GL can’t handle it.
If you have GL perf. problem, you can check your stream rules’, and pipeline rules’ performace…

Thank you for the response.

I have 8 cores and 16GB memory. I timed and tuned some of the regexes in the extractors and rules and started dropping some of the top talkers that don’t need to be logged. I was still seeing the journal being over used significantly so I doubled the size of the internal ring buffers and to my surprise it has been very stable all day today. even with 30-40K events per second (averages around 300K/min ,peak 350K/min) the journal isn’t seeing more than 1% usage with a 80GB journal.

it might be due to the type of traffic I see today, I will continue monitoring it and see if it starts getting clogged again.

Thank you very much.

If your Journal is filling up, it means your processing can’t keep up. If you processing can’t keep up it’s because you either don’t have enough CPUs or you have poorly performing message processing due to extractors or pipeline rules, or both. Increasing the journal size will only buffer more messages, which in some cases may be enough because it will get your through spikes in volume, but for extended spikes, your will eventually start flushing the journal and thus lose data anyway.

If your process buffer is full, but your output empty and the journal is filling up, it’s not Elasticsearch that’s the problem. You are hitting a limit with your compute. you can add more CPUs, add more nodes, or tune the processors to better handle the load. with 8 CPUs, I would set the process buffer to output buffer ratio to 6 and 2 or 7 and 1, output doesn’t need alot unless you are seeing the output buffer filling up.

Also make sure your extractors are running in the single digit micro second range. low double digit is also acceptable in most cases. if you are running some extractors in the 100s or 1000s of micro seconds. They are the cause of your slowness and need to be deleted or recreated.

another thing you can do to help is increase your Java heap size. By default it’s 1GB. bump it up slowly, but increasing it from 1GB will help.