RegEx Tools vs Graylog Matching

Hi all,

A simple question I am struggling with:
I use regex101.com to test some regex strings that I am creating. I get matches and all is good. Now on regex101.com it provides an output that shows all the matches as groups.
Example:

If there is a match the parts matched in the string are broken up into what is called “Groups”. Now in the graylog rules one has the following rule (snipit)

rule "Extract Snort alert fields"
when
  has_field("message")
then
  let m = regex("\\[Classification: (.+?)\\] \\[Priority: (\\d+)\\] \\{(.+?)\\} (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})(:(\\d{1,5}))? -> (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})(:(\\d{1,5}))?", to_string($message.message));
  set_field("classification", m["0"]); // set classification of the log entry
  set_field("Priority:", m["1"]); // set the priority of the log entry

When you referring to the results of the regex m[“1”], m[“7”] to set them to the fields in Graylog they don’t seem to match the group numbers of RegEx101.

Question:
Is there a way to debug what the “groups” as Graylog sees them so that one can correctly set them to fields.

Reason why, there is one value in my case src_ip that no matter what index I use in m["<Value"] I cannot find. The RegEx is finding dst_ip (which is after the value I want src_ip) but I cannot find src_ip value.

Ideally what I would like to see is what is in the array m (in my case code below)

Rule:

rule "Extract Snort alert fields"
when
  has_field("message")
then
  let m = regex("\\[Classification: (.+?)\\] \\[Priority: (\\d+)\\] \\{(.+?)\\} (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})(:(\\d{1,5}))? -> (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})(:(\\d{1,5}))?", to_string($message.message));

  set_field("snort_alert", true); // set snort alert
  set_field("application_name", "snort-alerts"); // set application name
  set_field("classification", m["0"]); // set classification of the log entry
  set_field("Priority:", m["1"]); // set the priority of the log entry
  set_field("Protocol", m["2"]); // which protocol was this log entry received on
  set_field("src_addr:", m["3"]); // source address of the log entry suspected address
  set_field("lng_src_port", m["4"]); // residue from regex on the source port
  let dp =  m["4"]; //setting variable for later regex
  set_field("Check dp",dp); // setting field as a debug
  set_field("src_port", m["5"]); // set the source port the traffic is coming from
  set_field("dst_addr", m["6"]); // what is the destination address that the traffic is going to
  set_field("lng_dst_port", m["7"]); // residue from original regex for destination port
  set_field("dst_addr", m["11"]); // destination address for the log entry
  let dp_port = regex_replace("[:]", to_string(m["7"]), ""); // try remove the leading : in the destination port
  set_field("dst_port",dp_port); // set the destination port from the dp_port variable 
  set_field("src_addrs:", m["3"]); // source address of the log entry suspected address
   end

I am sure I am doing something wrong however if not at code issue how to see what is stored in M

as regex101 does not use JAVA regex, that is not the best tool to test your regex … because you need to double escape in your Graylog/Java regex to make it work. So you might end up with something like:

regex("\\\[Classification: (.+?)\\\] \\\[Priority: (\\\d+)\\\] \\\{(.+?)\\\} (\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3})(:(\\\d{1,5}))? -> (\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3})(:(\\\d{1,5}))?", to_string($message.message));

this is not tested - just wrote this down to make the point clear.

 let m = regex("\\[Classification: (.+?)\\] \\[Priority: (\\d+)\\] \\{(.+?)\\} (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})(:(\\d{1,5}))? -> (\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})(:(\\d{1,5}))?", to_string($message.message));

Does not work - in your example:
regex("\\\[Classification: (.+?)\\\] \\\[Priority: (\\\d+)\\\] \\\{(.+?)\\\} (\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3})(:(\\\d{1,5}))? -> (\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3}\\\.\\\d{1,3})(:(\\\d{1,5}))?", to_string($message.message));

You don’t escape with \\ you escape with \ as per my code. The code works the question is more around not being able to find the specific field even though I can find all other fields. The field in question is an IP address. As per the code above I don’t do any special datatype conversions as per dst_ip (which works).

That is why the original question was to figure out if there is a way to see what Graylog is returning in the regex function so I can see which index in the array src_ip is. Right now I have tried indexes up to 14 even though Regex101 reports only 9, and still can’t find src_ip value any other ideas?

You can use debug in your code to see what the results are in the graylog log file:

tail -f /var/log/graylog-server/server.log

in your pipeline rule you can put in:

then
...    
  debug("---results of regex:");
  debug(to_string(m)); //to look at all parts in {}
  debug(to_string(m["12"]; //if you want to pull out item 12
...
end

regex101 will show the entire match as item [“0”] and the first capture as [“1”] graylog on the other hand ignores the entire match and considers the first capture as [“0”] (in my experience)

That is really cool and it works well (had to fix one typo and add two )).
What is really interesting now…

If we look at the debug:

2020-01-07T18:16:59.666+02:00 INFO  [Function] PIPELINE DEBUG: {0=Attempted Information Leak, 1=2, 2=UDP, 3=x.x.x.x, 4=:53, 5=53, 6=x.x.x.x, 7=:61348, 8=61348}

(replaced IP’s with x and y)
Its saying 3 which is the the one I want for src_ip however if we look at my code below:

set_field("src_addr:", m["3"]); // source address of the log entry suspected address

I am setting m[“3”] to the right field yet it is not being shown in the values on my pipeline. Whats even more perplexing is that I get the src_port which is being split just after the the src_addr.

Confused here any thoughts? Thanks for that tip that is a very useful one!

It could be the comma makes it a different field type than expected…if it captures it - or use to_ip() to ensure IP formatting.

3=x.x.x.x,

You could try:

replace(m["3"], ","); //removes just the comma

and/or:

set_field("src_addr:", to_ip(m["3"])); //not sure this would clean up the comma alone...

[corrected my other type-o… :slight_smile: ]

Thank you very much once again.

So doing the replace seems to replace if I add the debug for m[“3”] I can then see the right IP address. Yet still no src_addr shown when I inspect the stream, Not to sure why the hell or what the issue is. We got to the point now that we can see or get the right value at m[“3”]

Some more testing, and note I could just be reaching here and just been looking at this so long getting confused.

I tried to set a variable with a the value from m[“3”]:
Let scr = m[“3”]

Then print it out in the debug
debug(scr)

Interesting it prints: Passed value is NULL.

Excuse the post some reason on my iPad in the text box its not allowing me to select text and mark it as code.

Anything in debug() needs to be a string so it would be:

let scr = to_string(m["3"]); 
debug(scr);

which is a longer way to

debug(to_string(m["3"]));

… just spotted… I think that the colon is a reserved character you should not include that in the field names…

set_field("src_addr:", m["3"]);

Definitely cannot have spaces in the field name either… .

set_field("Check dp",dp);

Finally … thank you very much was racking my brain. Seems to have been the dam “:” reserved charc. I would of thought it being in a string it would not have mattered. Should have also picked it up as it was the only field I had the : after the field name…

Thank you once again learnt a lot!

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.