Efficient way to ingest lots of data and use GROK patterns?

NominaSumpta · April 12, 2020, 8:40pm

Goal:

Parse Postfix log lines from GROK pattern to be able to search Postfix-specific fields from Graylog. This should be done in a scalable way, as I will be ingesting all syslog data from 200+ hosts, and will have over 20+ Grok patterns to release on different application_type log lines (e.g. Dovecot, Postfix, etc.).

Method I thought of:

Ingest all log data via a single input
Use extractor with GROK pattern to parse

However, this would mean this extractor would be released on each and every input line. Sounds like there’s room for efficiency here,so I thought I’d create a separate ‘postfix’ stream (filter: application_type == "postfix"). However, an extractor is input-specific, and cannot be tied to a specific stream.

What is the best way to reach my goal?

cawfehman · April 14, 2020, 1:34am

If your GROK patterns are unique or mostly unique to the applications they are parsing, have a different input per application and associate the specific GROKs with the application you need. If you can’t/don’t want to create separate inputs per application, consider a multi staged pipeline to process the inputs.

NominaSumpta · April 14, 2020, 9:18am

Thanks for your reply. Much appreciated.

I don’t see how a pipeline would solve this problem. An extractor would still be input-specific regardless of the pipelines stages, right? So how would using a pipeline with multiple stages solve the problem of only being able to use an extractor on a specific input?

Creating a separate input for each facility seems a bit overkill. Especially as one syslog facility can still get messages from different applications.

cawfehman · April 15, 2020, 7:20pm

Correct, extractors are run against an input, while pipelines are tied to a stream. But extractors are just that… they extract data from data based on the input and allow you to label it, type it, and extract it based on some condition. With pipelines you can do more (better?) condition checking on the data and modify the data before sending it to the stream. So if you had a syslog message that you wanted to do some basic extraction/labeling, extractors are fine. But if you wanted to modify the data or have more granular condition check, you would need pipelines.

As for the separate input for each facility, it may be overkill for you, may not be for others, keep in mind, inputs are mainly tied to ports, and while most modern systems you can specify the port you send syslog to, some you can’t so if you start having multiple different systems all sending to 514, you will need to figure out a way to handle that. You may not face this issue, but it’s something to be aware of.

NominaSumpta · April 16, 2020, 9:22am

But that is not the case. An extractor extracts data from an input, right?

cawfehman · April 16, 2020, 12:03pm

Yes… from an input… I meant from a stream of data, but realize that is a poor choice in this case… I’ll edit it for clarity

NominaSumpta · April 21, 2020, 2:17pm

So for each log message ingested by my ‘one and only’ syslog input, I’ll have to do the following:

Label based on application_type
Extract using GROK pattern based on label

The fact that I would have to perform extraction without an extractor implies that the flow I thought of is not good practice. It also implies that it is acceptable practice to simply try every extractor for every syslog message, and if extracting using any GROK pattern fails, do nothing. But this seems to increase the risk of false positives (extracting log messages that weren’t meant to be extracted, which could be ‘failsafed’ by labeling messages beforehand).

So still not sure what the right tool for the job is in every stage.

cawfehman · April 21, 2020, 4:42pm

Can you modify the data at the source? you only need extractors if the system sending and the system receiving the message aren’t providing it for you by default. Some systems send CEF, that’s easily processed by the CEF input type… others send JSON, etc…

Sometimes the source lets you customize the message, perhaps that’s something you can look at. If it can’t and traditional syslog is your only format for sending… extractors and pipelines are your only options for getting anything more than timestamp, source, level, facility and message.

good luck

NominaSumpta · April 25, 2020, 8:20pm

Thanks! Got everything to work with pipelines.

system · May 9, 2020, 8:20pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unwanted Grok Fields Graylog Central (peer support) grok-patternspl	6	735	November 14, 2023
A Question about Extractors and Inputs Graylog Central (peer support)	2	884	September 4, 2017
GROK Pattern works in the extractor preview, but the logs are not processed Graylog Central (peer support)	12	68	August 21, 2024
Extractor to Pipeline conversion Graylog Central (peer support)	3	784	March 8, 2023
Pipelines for extracting multiple pieces Graylog Central (peer support)	2	371	March 11, 2018

Efficient way to ingest lots of data and use GROK patterns?

Related topics