Using ElasticSearch filter plugins

lindonm · February 25, 2019, 6:28am

Hi,

I’m very new to Graylog, ElasticSearch, ELK and so on - Until a week ago I had never used Logstash, Filebeats and so on, so please forgive me if this one is obvious.

I have set up a test VM with CentOS7 in order to evaluate Graylog.

I have been importing some log files from our Exchange environment, and it’s all going great - I have a filebeat/sidecar picking up logs just from a local folder (for testing purposes) and I have several streams, rules, lookup tables, adaptors, extractors etc and i’m almost happy with the data, except I can’t find how to do anything with the user_agent field.

My goal is to be able to report/filter on various properties of the user_agent, such as Browser version, OS type, etc.

I have tried using the API at http://www.useragentstring.com/pages/api.php using a data extractor and that works, but the results you get back are quite limited. I’ve been scouring the net and there’s a lot of references to ua-parser which uses regexes.yaml and this seems to be the “common” way of doing it - But I have very little coding skills and would prefer to keep this Graylog server as “out of the box” as possible.

ElasticSearch apparently has this built in (https://www.elastic.co/guide/en/logstash/current/plugins-filters-useragent.html) but for the life of me I can’t figure out how to use it.

I have tried adding to the filebeat config, but I get the error

level=error msg="[filebeat] Validation command output: Exiting: error initializing publisher: error initializing processors: the processor user_agent doesn't exist\n"

Filebeat config is as follows (obviously I got the error when it was uncommented)

# Needed for Graylog
fields_under_root: true
fields.collector_node_id: ${sidecar.nodeName}
fields.gl2_source_collector: ${sidecar.nodeId}

filebeat.inputs:
- input_type: log
  paths:
    - /tmp/loginput/ex2016iis/*.log
  type: log
output.logstash:
   hosts: ["192.168.93.228:5000"]
path:
  data: /var/lib/graylog-sidecar/collectors/filebeat/data
  logs: /var/lib/graylog-sidecar/collectors/filebeat/log
  
processors:
    - drop_event:
        when:
            regexp:
                message: "^#"
# This doesn't work
#    - user_agent: 
#        field: "user_agent_string"

If the ability is built in, i’d much rather do that, but i’ve really only got a vague understanding of how the whole stack fits together so far.

Any help or comments greatly appreciated

lindonm · February 25, 2019, 6:30am

Sorry using a “#” symbol made the forum format the lines as headings. Here’s the processor section without the comments.

processors:
- drop_event:
when:
regexp:
message: “^#”
- user_agent:
field: “user_agent_string”

laakkus · February 25, 2019, 7:09am

https://community.graylog.org/faq#format-markdown

lindonm · February 25, 2019, 8:44am

Thanks

First post edited for formatting

benvanstaveren · February 25, 2019, 9:38am

The filter plugin you refer to is not for Filebeat, but for Logstash - I’m afraid your options are limited at the moment, you could always create a little web service that uses the ua-parser to handle things, and use that in a lookup table instead.

maniel · February 25, 2019, 2:25pm

to make filebeat filter out messages sent by regexp you can use exclude_lines option in inputs section

lindonm · February 25, 2019, 9:48pm

Thanks @benvanstaveren

So there is no option to use logstash as a sidecar instead of filebeat?

I’ve logged a feature request here https://graylog.ideas.aha.io/ideas/GL-I-62 - If anyone else is interested in this functionality, please up-vote

lindonm · February 25, 2019, 9:51pm

Thanks @maniel Daniel,

The processors/drop_event I am using is working for me - Is there any benefit in using exclude_lines instead?

And a question on a similar track - Before I learned about drop_event, I created a pipeline rule to drop other lines containing a specific URL (used by server monitors).

Overall out of those 3 approaches, which gives the best performance? (Assuming I never want to store or report on these specific lines)

benvanstaveren · February 25, 2019, 10:54pm

Logstash is not a log collector, it’s a log processor. Sort of like Graylog…

benvanstaveren · February 25, 2019, 10:56pm

The exclude_lines makes sure that the line in question is never sent to Graylog, which means that you move a few CPU cycles from the Graylog server to the server running filebeat (since it does have to check the messages).

Using a pipeline with drop_event though will also not really impact performance that much, depending on the complexity of the rule. But it does give you the advantage of having the “decision” to accept or reject something be in one place.

I’ve found it’s often easier to tell filebeat to just send everything, and figure out on the Graylog end what you do or don’t want. Often easier to update a pipeline rule to apply to all hosts at once than to have to reconfigure filebeat everywhere

lindonm · February 25, 2019, 10:59pm

Many thanks Ben, and considering the scale that we are considering using this at, that’s actually very good advice - I already have seperate streams for my different input types, and individual rules that can be re-used against different streams, so from a management overhead point of view, a few extra CPU cycles is probably worth it

benvanstaveren · February 25, 2019, 11:09pm

Not sure at what scale you’re doing things, but we’re currently running 3 Graylog nodes with some hilarious pipelines (a few of which are ‘blacklist/whitelist’ type things that drop messages) and we’re consistently pushing through about 3.000 msg/sec - with CPU usage on the nodes at a comfortable 25%.

As far as it goes though, the pipelines with blacklisting are set up in an “expensive” way, e.g. our nginx access log processing pipeline (well, one of them) has a stage 0 with a set of regular rules to parse out the logs, and a stage 1 after that which will discard selected messages so we’ve already taken the “hit” on the processing, but some messages are just not worth saving.

You can also turn that around and either selectively accept messages (by doing a rule that sets a flag if the message is ok, then in the next stage have 2 rules where 1 rule checks to see if the flag is present and does the normal processing, the other checks to see if the flag is not present and drops the message).

But I found that purely from a management perspective, it’s easier to have Graylog do it, and spend a few CPU cycles. Then again we run on bare metal 24 core machines, so we have enough CPU to spare.

lindonm · February 26, 2019, 9:45pm

For anyone who may be looking to do the same thing, here is the solution that I came up with. I’d love to hear people’s thoughts/opinions on how this could be done better (I’m relatively new to linux, and brand new to Graylog, ELK and yesterday learned how to install and use node.js - So all tips welcome )

Install node.js
Install node modules ua-parser and forever
Create a server.js for ua-parser
Use forever to run ua-parser as a service
Set up a JSON data extractor / Lookup table
Use a pipeline rule to query lookup table

Install node.js & modules

# Note: We have a corporate proxy that workstations must use to get to the internet (my dev box running as a VM on my workstation)
# Remove or edit proxy information per your environment
cd /tmp
curl --proxy http://<your corp proxy server:port> -O https://nodejs.org/dist/v10.15.1/node-v10.15.1-linux-x64.tar.xz
mkdir -p /usr/local/lib/nodejs
tar -xJvf node-v10.15.1-linux-x64.tar.xz -C /usr/local/lib/nodejs
export PATH=/usr/local/lib/nodejs/node-v10.15.1-linux-x64/bin/:$PATH
npm config set https-proxy http://<your corp proxy server:port>
npm config set http-proxy http://<your corp proxy server:port>
cd /usr/local
npm install ua-parser
#<edit/create /usr/local/node_modules/ua-parser/server.js> - See below for info
npm install forever -g
forever start /usr/local/node_modules/ua-parser/server.js

Server.js

// Adapted from https://www.npmjs.com/package/ua-parser for use as a JSON api in Graylog, to extract user agent strings

var http = require('http');
var url = require('url');
var uaparser = require('ua-parser')
 
var server = http.createServer(function (req, res) {
	
	var parsedUrl = url.parse(req.url, true);
	
	// The function expects an un-escaped string, but most of my logs have a "+" sign instead of a space
	// Escape the string so that I can do an easy regex to replace all "+" signs, then un-escape it again for the function
	var cleanstring = escape(parsedUrl.query.query).replace(/\+/g,"%20");

	// Now run the uaparser function
	var ua = uaparser.parse(unescape(cleanstring));
	
	// Finally, turn it into a JSON query for easy consumption by Graylog and return the response
	var json_response = JSON.stringify(ua);
	res.statusCode = 200;
	res.end(json_response);
   
});
server.listen(5001, () => console.log("Server running at port 5001"));

Data adaptor

Lookup table

Pipleline rule

rule "extract_user_agent"
when
    has_field("user_agent_string") 
then
    let ua = lookup("lookup_user_agent", to_string($message."user_agent_string"));
	
	set_field("user_agent_string", ua["string"]);
	set_field("user_agent_app_family", ua["userAgent"].family);
	set_field("user_agent_app_version_major", ua["userAgent"].major);
	set_field("user_agent_app_version_minor", ua["userAgent"].minor);
	set_field("user_agent_app_version_patch", ua["userAgent"].patch);
	
	set_field("user_agent_os_family", ua["os"].family);
	set_field("user_agent_os_version_major", ua["os"].major);
	set_field("user_agent_os_version_minor", ua["os"].minor);
	set_field("user_agent_os_version_patch", ua["os"].patch);	

	set_field("user_agent_device_family", ua["device"].family);
end

I hope that helps someone else - I’ve found these forums very helpful so far, most of questions already answered, so happy to contribute back

benvanstaveren · February 27, 2019, 9:00am

Not sure if this exists already but the Graylog team should really kick up a wiki of sorts where we can put little cookbook recipes like this

jan · February 27, 2019, 9:24am

I agree - but currently we could only create a sticky post that can be edited with that in this community.

But for the above I created this feature issue: https://github.com/Graylog2/graylog2-server/issues/5727

benvanstaveren · February 27, 2019, 10:08am

Sticky posts would also work, but then we may need a “best-of” type section on the forums here where posts like this can get archived or stickied

But good to see the request is in!

Topic		Replies	Views
Migration from ELK where to find data views or extractors Graylog Central (peer support)	11	473	September 15, 2023
Filebeat configuration Graylog Central (peer support) sidecar , filebeat-linux	46	51533	July 2, 2018
Not receiving all logs in Graylog Graylog Central (peer support) sidecar , filebeat-linux , nosendlogfblx , clfblx	7	4668	April 5, 2019
Beats Input messages not showing in Graylog Graylog Central (peer support)	2	1270	August 3, 2022
Graylog setup not showing logs Graylog Central (peer support) sidecar , filebeat-linux , elastic	23	2342	December 21, 2022

Using ElasticSearch filter plugins

Related topics