Efficient way to ingest lots of data and use GROK patterns?


Parse Postfix log lines from GROK pattern to be able to search Postfix-specific fields from Graylog. This should be done in a scalable way, as I will be ingesting all syslog data from 200+ hosts, and will have over 20+ Grok patterns to release on different application_type log lines (e.g. Dovecot, Postfix, etc.).

Method I thought of:

  • Ingest all log data via a single input
  • Use extractor with GROK pattern to parse

However, this would mean this extractor would be released on each and every input line. Sounds like there’s room for efficiency here,so I thought I’d create a separate ‘postfix’ stream (filter: application_type == "postfix"). However, an extractor is input-specific, and cannot be tied to a specific stream.

What is the best way to reach my goal?

If your GROK patterns are unique or mostly unique to the applications they are parsing, have a different input per application and associate the specific GROKs with the application you need. If you can’t/don’t want to create separate inputs per application, consider a multi staged pipeline to process the inputs.

Thanks for your reply. Much appreciated.

I don’t see how a pipeline would solve this problem. An extractor would still be input-specific regardless of the pipelines stages, right? So how would using a pipeline with multiple stages solve the problem of only being able to use an extractor on a specific input?

Creating a separate input for each facility seems a bit overkill. Especially as one syslog facility can still get messages from different applications.

Correct, extractors are run against an input, while pipelines are tied to a stream. But extractors are just that… they extract data from data based on the input and allow you to label it, type it, and extract it based on some condition. With pipelines you can do more (better?) condition checking on the data and modify the data before sending it to the stream. So if you had a syslog message that you wanted to do some basic extraction/labeling, extractors are fine. But if you wanted to modify the data or have more granular condition check, you would need pipelines.

As for the separate input for each facility, it may be overkill for you, may not be for others, keep in mind, inputs are mainly tied to ports, and while most modern systems you can specify the port you send syslog to, some you can’t so if you start having multiple different systems all sending to 514, you will need to figure out a way to handle that. You may not face this issue, but it’s something to be aware of.

But that is not the case. An extractor extracts data from an input, right?

Yes… from an input… I meant from a stream of data, but realize that is a poor choice in this case… I’ll edit it for clarity

So for each log message ingested by my ‘one and only’ syslog input, I’ll have to do the following:

  • Label based on application_type
  • Extract using GROK pattern based on label

The fact that I would have to perform extraction without an extractor implies that the flow I thought of is not good practice. It also implies that it is acceptable practice to simply try every extractor for every syslog message, and if extracting using any GROK pattern fails, do nothing. But this seems to increase the risk of false positives (extracting log messages that weren’t meant to be extracted, which could be ‘failsafed’ by labeling messages beforehand).

So still not sure what the right tool for the job is in every stage.

Can you modify the data at the source? you only need extractors if the system sending and the system receiving the message aren’t providing it for you by default. Some systems send CEF, that’s easily processed by the CEF input type… others send JSON, etc…

Sometimes the source lets you customize the message, perhaps that’s something you can look at. If it can’t and traditional syslog is your only format for sending… extractors and pipelines are your only options for getting anything more than timestamp, source, level, facility and message.

good luck

Thanks! Got everything to work with pipelines.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.