For the past few months we’ve had intermittent issues with Graylog where it will no longer submit anything to Elasticsearch; or rather, that’s what it looks like. A bit of playing with dumping out process buffers later (over the course of a few weeks any time things got “stuck”) shows that processing apparently stops; messages in the buffer are partial JSON, most likely caused by Beats not merging lines correctly, and we do some json parsing in pipelines. It appears this either gets stuck in a loop and eventually exits it, or just flat out gets stuck.
At current peak usage we ingest ~6000 msg/sec, so any time this problem occurs we end up with an easy 10 to 20 million(!) messages in the journal, which takes a fair amount to catch up.
Anyone have any idea what on earth causes this, whether there’s any tuning options I can set to force pipelines to time out and skip processing, or, well, anything really. We’ve tried a variety of things, unfortunately we’re not always in control of the beats configuration on the sender end…
Whenever I see things like this, my default is to think of runaway regex/GROK. I guess the better term might be Catastrophic Backtracking as talked about it in this article: Runaway Regular Expressions: Catastrophic Backtracking
It’s a long winded read for your question and not a solution but perhaps it will point you in the right direction…
We’ve had 4.3.3 running for ages, it seems more frequent in that version but it has happened before then too. An upgrade is in the pipeline, but it’s currently on hold because I’m trying to be on vacation (heh…) and I also want to switch from ES to OpenSearch while I’m at it, and rolling upgrades on a 20-odd node cluster isn’t a couple-of-hours affair
In the pipeline itself the only regex that exists is in a rule that checks if the message looks like JSON; the awful simple /^\{.*\}$/ which, as far as I know, can’t run away since there’s no backtracking due to anchoring. Still a good read though, thank you!
Small update; I moved the JSON extraction into an extractor on the input, which seems to have solved the issue, we haven’t done this before on account of receiving mixed json/non-json logs, and for it to work (and not use any regexes) we need to basically always attempt extraction which seemed like wasted cpu cycles for the non-json stuff.
But I’d rather have some wasted cpu cycles at the moment and not have the pipelines get jammed
There is the ‚is_json‘ function for processing pipelines and you can use that in the when clause.
That is giving better results than the false match of the regex.