The original data was full of PII, I compiled a minimal example.
All this was part of a bigger grok pattern, but the surrounding things are not relevant. The relevant part of the pattern looked like this:
‘(<%{EMAILLOCALPART}@%{HOSTNAME}>[,\s]?)+’
generating this regular expression:
‘(<[a-zA-Z][a-zA-Z0-9_.±=:]+@\b(?:[0-9A-Za-z][0-9A-Za-z-]{0,100})(?:.(?:[0-9A-Za-z][0-9A-Za-z-]{0,100}))*(.?|\b)>[,\s]?)+’
This is the relevant extract from the message
'<asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <0asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid> <asdf@inv.alid>'
If you guys want to reproduce this, here is the stuff in a small test harness: GitHub - jrunu/java_regex_complexity_poc
There are some interesting points to take away from this. To make this explode in your face there are two things needed :
- A good amount of repetion of email addresses
- An email adress starting with a number
EMAILLOCALPART assumes that all email addresses start with an alphabetic symbol, which is not quite correct. This makes the whole thing not match. And then the a whole lot of backtracking ensues. Which I can’t quite wrap my head around, because I assumed that the < and > would be anchors enough. But apparently that’s not how things work.
Besides fixing EMAILLOCALPART, a way to reduce the amount of backtracing is to make the outermost group an atomic group i.e. a group that is treated as “one” when backtracking:
‘(?><%{EMAILLOCALPART}@%{HOSTNAME}>[,\s]?)+’
In this example it then only tries to back track, I think, 81 times instead of a metric fuckton of times.
And of course using
‘(<[^>]+>[,\s]?)+’
instead would have worked just as well and faster.