One complex grok Vs a set of regex extractors

Hi, I have Netscreen Firewall logs flowing into Graylog, and was looking into extracting ~8 fields (such as source and destination IPs, source and destination zones, device name, etc.). The logs look like the following :
isg1000-A2: NetScreen device_id=0000011001000011 [Root]system-notification-00257(traffic): start_time=“2015-11-11 10:02:10” duration=0 policy_id=244 service=https proto=6 src zone=Untrust dst zone=Trust action=Permit sent=0 rcvd=0 src= dst= src_port=1732 dst_port=443 src-xlated ip= port=22041 dst-xlated ip= port=443 session_id=488451 reason=Creation

One possibility which I have tested working is the use the following grok extractor (from the Marketplace):
start_time=%{DATA:start_time} duration=%{INT:duration} policy_id=%{INT:policy_id} service=%{DATA:service} proto=%{INT:proto} src zone=%{WORD:src_zone} dst zone=%{WORD:dst_zone} action=%{WORD:action} sent=%{INT:sent} rcvd=%{INT:rcvd} src=%{IP:src_ip} dst=%{IP:dst_ip} src_port=%{INT:src_port} dst_port=%{INT:dst_port} src-xlated ip=%{IP:scr-xlated_ip} port=%{INT:src-xlated_port} dst-xlated ip=%{IP:dst-xlated_ip} port=%{INT:dst-xlated_port} session_id=%{INT:session_id} reason=%{GREEDYDATA:reason}

Since most of the fields are key=value pairs, another method would be to create a regex extractor for each field I want, for a total of 8 regex extractors, for example :

Which way is better from a performence perspective ? One complex grok or 8 “light” regex extractors ?

Thanks in advance.

While I’m not sure from a performance perspective which has better performance, I wanted to point one thing out with regards to the grok patterns. You don’t need to have a complex grok pattern like you have above. If fact I would advise against it, the reason is because depending on the vendor, the number of elements in a syslog message can change with every message or every other message.

For example in your message, you have 20 elements. If netscreen always sends all 20 elements with every message, then it’s probably not too bad. If it doesn’t then you’ll need to have a complex grok pattern for every variation of the message and then have a way to tell Graylog to parse it based on some criteria or it will try every complex grok pattern on every message that comes in and if the pattern doesn’t match EXACTLY, it won’t extract anything and that’s not what you want.

To solve this, you could have small groups or individual extractors that run on every message or certain messages based on a criteria.

For example:

start_time=%{DATA:start_time} duration=%{INT:duration}
runs against every message

src=%{IP:src_ip} dst=%{IP:dst_ip} src_port=%{INT:src_port} dst_port=%{INT:dst_port}
runs on messages that have criteria src=

Other option is to have 1 grok pattern per element:

runs against every message

runs against every message


worth pointing out that since you only want to extract 8 fields, your grok pattern will extract 20 which will use more storage and take longer… etc. You could remedy that also by only extracting Named captures and just leaving the fields you don’t care about unnamed.

start_time=%{DATA:start_time} - Named capture
start_time=%{DATA} - unnamed capture

hth… good luck

1 Like

Very helpful indeed, thank you.

But the question about performence stands, ie if I can configure either Grok and/or regex to extract one element, which one is less greedy on CPU?

start_time=%{DATA:start_time} Vs .+\sstart_time=(\S+)\s

I’m no expert in the arena, but grok patterns are mainly canned RegEx, so based simply on that, I would assume that a well formed RegEx will always outperform a Grok pattern.

he @cawfehman

I would assume that a well formed RegEx will always outperform a Grok pattern.

That is 100% true

@H2Cyber if you are able to create a simple regex to extract the needed/wanted information. Use that in favour to a GROk pattern. They are mostly build to capture anything possible and not very specific.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.