Graylog nodes stop outputting/fill up buffers

The original data was full of PII, I compiled a minimal example.

All this was part of a bigger grok pattern, but the surrounding things are not relevant. The relevant part of the pattern looked like this:

‘(<%{EMAILLOCALPART}@%{HOSTNAME}>[,\s]?)+’

generating this regular expression:

‘(<[a-zA-Z][a-zA-Z0-9_.±=:]+@\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,100})(?:.(?:[0-9A-Za-z][0-9A-Za-z-]{0,100}))*(.?|\b)>[,\s]?)+’

This is the relevant extract from the message

'<asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <0asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid>'

If you guys want to reproduce this, here is the stuff in a small test harness: GitHub - jrunu/java_regex_complexity_poc

There are some interesting points to take away from this. To make this explode in your face there are two things needed :

  1. A good amount of repetion of email addresses
  2. An email adress starting with a number

EMAILLOCALPART assumes that all email addresses start with an alphabetic symbol, which is not quite correct. This makes the whole thing not match. And then the a whole lot of backtracking ensues. Which I can’t quite wrap my head around, because I assumed that the < and > would be anchors enough. But apparently that’s not how things work.

Besides fixing EMAILLOCALPART, a way to reduce the amount of backtracing is to make the outermost group an atomic group i.e. a group that is treated as “one” when backtracking:

‘(?><%{EMAILLOCALPART}@%{HOSTNAME}>[,\s]?)+’

In this example it then only tries to back track, I think, 81 times instead of a metric fuckton of times.

And of course using

‘(<[^>]+>[,\s]?)+’

instead would have worked just as well and faster.

2 Likes