Best practice - unwanted field extraction

Ric · January 3, 2022, 3:24pm

I’m finding myself in a situation in which the default Graylog extractor is working… a bit too well.

An email security appliance is sending logs in the format:

...client_ip=x.x.x.x client_cc="US" dst_ip=y.y.y.y from="1axcflbq9qlakhml8acdliyy977alidg1uwdb9-k+bob=customdomain.com@bf07x.hubspotemail.net" to="bob@customdomain.com" polid="0:1:2:SYSTEM" domain="customdomain.com"...

Now, Graylog is doing an amazing job populating the different fields with the correct information, with one exception: the from field contains a = and this will end up with an additional, unwanted field.

In the previous example:

expected field from value 1axcflbq9qlakhml8acdliyy977alidg1uwdb9-k+bob=customdomain.com@bf07x.hubspotemail.net
unwanted field bob value customdomain.com@bf07x.hubspotemail.net

Considering bob is a username, this will rapidly cause an explosion in the number of fields, causing indexing failures as soon as the Elasticsearch field limit is reached for that index.

The questions I couldn’t answer are:

What is the best practice in a scenario like this?
Can I effectively correct the behavior of the default extractor, avoiding the double extraction in the first place?
If not, can I somehow efficiently remove those fields, whose name I cannot know in advance?

I’m running Graylog 4.2.4

Thanks a lot!
Riccardo

tmacgbay · January 3, 2022, 5:37pm

I am not sure how you sequence with extractors (@gsmith) … I have handled something similar before in pipeline rules. I set the message to a variable, pulled out the challenging data with regex and assigned it to a field, a little repetitive but you can use regex_replace() to then clear it from the variable… then used key_value() to pull rest of the fields out. Here is sample regex that would allow you to just pull the “from” data from your message:

(?<=from=\")(.+?)(?=\")

gsmith · January 3, 2022, 11:46pm

Hello,

Could you describe your configurations (type of input, type of devices sending logs, extractors, etc…)?
Second, question do you want to patch this or completely fix this issue?

Ric · January 4, 2022, 10:24am

Hi, thanks for the replies!

The current configuration is a syslog UDP input, receiving Fortimail-related logs from a Fortianalyzer appliance. At the moment there are no custom extractors active on that input.

I realized the from field is the most common, but not the only field that can trigger the issue, in the end related to the way the appliance formats the messages.

Specifically, there’s a mixture of standards in the message:

field=value
field=“value”

Looks like the default extractor attempts to extract values using both standards, succeeding in 99% of the cases but also causing the index of those unwanted fields.

Here’s a full example:

date=2021-12-30 time=16:02:17 logver=700020177 timestamp=1640883737 devname="smtp" devid="FEVM0200000****" vd="root" itime=1640883738 logver=0700020177 time="18:02:19.158" devname="smtp" device_id="FEVM0200000****" log_id="0003027281" type="event" subtype="smtp" pri="information" user="mail" ui="mail" action="NONE" status="N/A" session_id="1BUH2I6a027278-1BUH2I6c027278" msg="to=<support-message@domain.com>, delay=00:00:01, xdelay=00:00:01, mailer=esmtp, pri=1561532, relay=[x.x.x.x\] [y.y.y.y\], dsn=2.0.0, stat=Sent (<CA+a2MrYQaauH+0=jL3zoRioD8n9NQMvU3z3iCRnaaTAa8AE3bg@mail.domain.com> [InternalId=46909632807007, Hostname=xyz.domain.local\] 1462256 bytes in 0.203, 7018.295 KB/sec Queued mail for delivery)" tz="-0100"

In this example:

the date=... field will be correctly parsed
the msg="..." field will be correctly parsed
a whole lot of unwanted fields will be indexed from the content of the msg field, eg 0 = jL3zoRioD8n9NQMvU3z3iCRnaaTAa8AE3bg@mail.domain.com

@gsmith 'm not sure what you mean with patch or completely fix, my main goal would be to understand if there’s any best practice in this scenario and possibly understand which options might help solve the issue, if any…

Thanks again!

gsmith · January 4, 2022, 10:34pm

Hello,

I understand now, Have you tried to use Raw/Plaintext TCP/UDP INPUT? Perhaps that would minimize the number of fields being generated.

For Example:
I have FortiGate firewalls (60E, 100, 200) in my environment. Then I created Raw/Plaintext INPUT. On that input I created ONLY the fields I need for alerts and notifications.
I have tried Syslog UDP but didn’t work well in our environment.

Here are some of my extractors, this is my Lab GL server so it has a lot more then our production one.

If you decide to go that route, I do have a lot of regex for extractor configurations I can offer.

Ric · January 5, 2022, 1:23pm

Hi,
Thank you very much, the Raw/Plaintext UDP input with custom extractors might be a suitable solution for our need, I gave it a try and the first results are very encouraging, now it’s just a matter of tuning the regex based on the specific needs!

Thanks again!

gsmith · January 5, 2022, 10:25pm

You more then welcome, If you need some help with extractors for Fortinet Logs I would be more then glad to share those with you

system · January 19, 2022, 10:25pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can Someone Help me Graylog Central (peer support) pipeline-rules	5	468	July 20, 2022
Hello, I need help in search and extractors, the elasticsearch find more fields than they really are Graylog Central (peer support)	6	406	July 5, 2019
Remove some fields from message Graylog Central (peer support)	7	1471	August 25, 2022
Graylog extractor data type problem Graylog Central (peer support)	4	898	September 9, 2020
Regex in Search Field Graylog Central (peer support)	6	1572	September 20, 2018

Best practice - unwanted field extraction

Related topics