I have two thoughts on this, but I doubt they will be the solution:
You might be victim of a regexplot (Regexploit: DoS-able Regular Expressions · Doyensec's Blog). But I doubt this would happen with your two greedy * in your Regex. Those will basically occupy a threat of your workers for prosessing and slow everything down, until there is no one left. Here the Processbuffer filled up.
do you use archiving? We hat the problem that archiving occupied all available connections from Graylog to Elastic. Then no more messages could be spit out to Elastic and the output buffer filled up.
Thing is it’s worked fine for years like this, it’s only recently it started acting up. The fun thing is, of course, that “nothing has changed” (the irony in that statement left as exercise to the reader ) - I’m hoping the feature I put in for timeouts on rules etc. will also solve the issue in the sense that we can then pinpoint exactly what’s cooking - or not cooking, for that matter
If nothing changed on graylog server, maybe something changed on the sender side? Is there something sending f*cked up messages that break your processing?
I would definitely give the pipeline metrics a shot, i don’t see any other useful approach to your problem except try & error.
Oh, BTW: I just stumbled over Strange pipeline timestamp behaviour - @gaetan-craft tries to measure log latency there. Maybe that would be an interesting approach? Measure the time the message needs to be processed? That could be a last step in your pipeline workflow and maybe give a hint towards messages that need more processing time than others.