Now, Graylog is doing an amazing job populating the different fields with the correct information, with one exception: the from field contains a = and this will end up with an additional, unwanted field.
In the previous example:
expected field from value 1axcflbq9qlakhml8acdliyy977alidg1uwdb9-k+bob=customdomain.com@bf07x.hubspotemail.net
unwanted field bob value customdomain.com@bf07x.hubspotemail.net
Considering bob is a username, this will rapidly cause an explosion in the number of fields, causing indexing failures as soon as the Elasticsearch field limit is reached for that index.
The questions I couldn’t answer are:
What is the best practice in a scenario like this?
Can I effectively correct the behavior of the default extractor, avoiding the double extraction in the first place?
If not, can I somehow efficiently remove those fields, whose name I cannot know in advance?
I am not sure how you sequence with extractors (@gsmith) … I have handled something similar before in pipeline rules. I set the message to a variable, pulled out the challenging data with regex and assigned it to a field, a little repetitive but you can use regex_replace() to then clear it from the variable… then used key_value() to pull rest of the fields out. Here is sample regex that would allow you to just pull the “from” data from your message:
Could you describe your configurations (type of input, type of devices sending logs, extractors, etc…)?
Second, question do you want to patch this or completely fix this issue?
The current configuration is a syslog UDP input, receiving Fortimail-related logs from a Fortianalyzer appliance. At the moment there are no custom extractors active on that input.
I realized the from field is the most common, but not the only field that can trigger the issue, in the end related to the way the appliance formats the messages.
Specifically, there’s a mixture of standards in the message:
field=value
field=“value”
Looks like the default extractor attempts to extract values using both standards, succeeding in 99% of the cases but also causing the index of those unwanted fields.
a whole lot of unwanted fields will be indexed from the content of the msg field, eg 0 = jL3zoRioD8n9NQMvU3z3iCRnaaTAa8AE3bg@mail.domain.com
@gsmith 'm not sure what you mean with patch or completely fix, my main goal would be to understand if there’s any best practice in this scenario and possibly understand which options might help solve the issue, if any…
I understand now, Have you tried to use Raw/Plaintext TCP/UDP INPUT? Perhaps that would minimize the number of fields being generated.
For Example:
I have FortiGate firewalls (60E, 100, 200) in my environment. Then I created Raw/Plaintext INPUT. On that input I created ONLY the fields I need for alerts and notifications.
I have tried Syslog UDP but didn’t work well in our environment.
Here are some of my extractors, this is my Lab GL server so it has a lot more then our production one.
Hi,
Thank you very much, the Raw/Plaintext UDP input with custom extractors might be a suitable solution for our need, I gave it a try and the first results are very encouraging, now it’s just a matter of tuning the regex based on the specific needs!