Using ElasticSearch filter plugins

Hi,

I’m very new to Graylog, ElasticSearch, ELK and so on - Until a week ago I had never used Logstash, Filebeats and so on, so please forgive me if this one is obvious.

I have set up a test VM with CentOS7 in order to evaluate Graylog.

I have been importing some log files from our Exchange environment, and it’s all going great - I have a filebeat/sidecar picking up logs just from a local folder (for testing purposes) and I have several streams, rules, lookup tables, adaptors, extractors etc and i’m almost happy with the data, except I can’t find how to do anything with the user_agent field.

My goal is to be able to report/filter on various properties of the user_agent, such as Browser version, OS type, etc.

I have tried using the API at http://www.useragentstring.com/pages/api.php using a data extractor and that works, but the results you get back are quite limited. I’ve been scouring the net and there’s a lot of references to ua-parser which uses regexes.yaml and this seems to be the “common” way of doing it - But I have very little coding skills and would prefer to keep this Graylog server as “out of the box” as possible.

ElasticSearch apparently has this built in (https://www.elastic.co/guide/en/logstash/current/plugins-filters-useragent.html) but for the life of me I can’t figure out how to use it.

I have tried adding to the filebeat config, but I get the error

level=error msg="[filebeat] Validation command output: Exiting: error initializing publisher: error initializing processors: the processor user_agent doesn't exist\n"

Filebeat config is as follows (obviously I got the error when it was uncommented)

# Needed for Graylog
fields_under_root: true
fields.collector_node_id: ${sidecar.nodeName}
fields.gl2_source_collector: ${sidecar.nodeId}

filebeat.inputs:
- input_type: log
  paths:
    - /tmp/loginput/ex2016iis/*.log
  type: log
output.logstash:
   hosts: ["192.168.93.228:5000"]
path:
  data: /var/lib/graylog-sidecar/collectors/filebeat/data
  logs: /var/lib/graylog-sidecar/collectors/filebeat/log
  
processors:
    - drop_event:
        when:
            regexp:
                message: "^#"
# This doesn't work
#    - user_agent: 
#        field: "user_agent_string"

If the ability is built in, i’d much rather do that, but i’ve really only got a vague understanding of how the whole stack fits together so far.

Any help or comments greatly appreciated :slight_smile:

Sorry using a “#” symbol made the forum format the lines as headings. Here’s the processor section without the comments.

processors:
- drop_event:
when:
regexp:
message: “^#”
- user_agent:
field: “user_agent_string”

1 Like

https://community.graylog.org/faq#format-markdown
:wink:

Thanks :grinning:

First post edited for formatting

The filter plugin you refer to is not for Filebeat, but for Logstash - I’m afraid your options are limited at the moment, you could always create a little web service that uses the ua-parser to handle things, and use that in a lookup table instead.

to make filebeat filter out messages sent by regexp you can use exclude_lines option in inputs section

Thanks @benvanstaveren

So there is no option to use logstash as a sidecar instead of filebeat?

I’ve logged a feature request here https://graylog.ideas.aha.io/ideas/GL-I-62 - If anyone else is interested in this functionality, please up-vote :smile:

Thanks @maniel Daniel,

The processors/drop_event I am using is working for me - Is there any benefit in using exclude_lines instead?

And a question on a similar track - Before I learned about drop_event, I created a pipeline rule to drop other lines containing a specific URL (used by server monitors).

Overall out of those 3 approaches, which gives the best performance? (Assuming I never want to store or report on these specific lines)

Logstash is not a log collector, it’s a log processor. Sort of like Graylog…

The exclude_lines makes sure that the line in question is never sent to Graylog, which means that you move a few CPU cycles from the Graylog server to the server running filebeat (since it does have to check the messages).

Using a pipeline with drop_event though will also not really impact performance that much, depending on the complexity of the rule. But it does give you the advantage of having the “decision” to accept or reject something be in one place.

I’ve found it’s often easier to tell filebeat to just send everything, and figure out on the Graylog end what you do or don’t want. Often easier to update a pipeline rule to apply to all hosts at once than to have to reconfigure filebeat everywhere :wink:

Many thanks Ben, and considering the scale that we are considering using this at, that’s actually very good advice - I already have seperate streams for my different input types, and individual rules that can be re-used against different streams, so from a management overhead point of view, a few extra CPU cycles is probably worth it

Not sure at what scale you’re doing things, but we’re currently running 3 Graylog nodes with some hilarious pipelines (a few of which are ‘blacklist/whitelist’ type things that drop messages) and we’re consistently pushing through about 3.000 msg/sec - with CPU usage on the nodes at a comfortable 25%.

As far as it goes though, the pipelines with blacklisting are set up in an “expensive” way, e.g. our nginx access log processing pipeline (well, one of them) has a stage 0 with a set of regular rules to parse out the logs, and a stage 1 after that which will discard selected messages so we’ve already taken the “hit” on the processing, but some messages are just not worth saving.

You can also turn that around and either selectively accept messages (by doing a rule that sets a flag if the message is ok, then in the next stage have 2 rules where 1 rule checks to see if the flag is present and does the normal processing, the other checks to see if the flag is not present and drops the message).

But I found that purely from a management perspective, it’s easier to have Graylog do it, and spend a few CPU cycles. Then again we run on bare metal 24 core machines, so we have enough CPU to spare.

For anyone who may be looking to do the same thing, here is the solution that I came up with. I’d love to hear people’s thoughts/opinions on how this could be done better (I’m relatively new to linux, and brand new to Graylog, ELK and yesterday learned how to install and use node.js - So all tips welcome :smile: )

  1. Install node.js
  2. Install node modules ua-parser and forever
  3. Create a server.js for ua-parser
  4. Use forever to run ua-parser as a service
  5. Set up a JSON data extractor / Lookup table
  6. Use a pipeline rule to query lookup table

Install node.js & modules

# Note: We have a corporate proxy that workstations must use to get to the internet (my dev box running as a VM on my workstation)
# Remove or edit proxy information per your environment
cd /tmp
curl --proxy http://<your corp proxy server:port> -O https://nodejs.org/dist/v10.15.1/node-v10.15.1-linux-x64.tar.xz
mkdir -p /usr/local/lib/nodejs
tar -xJvf node-v10.15.1-linux-x64.tar.xz -C /usr/local/lib/nodejs
export PATH=/usr/local/lib/nodejs/node-v10.15.1-linux-x64/bin/:$PATH
npm config set https-proxy http://<your corp proxy server:port>
npm config set http-proxy http://<your corp proxy server:port>
cd /usr/local
npm install ua-parser
#<edit/create /usr/local/node_modules/ua-parser/server.js> - See below for info
npm install forever -g
forever start /usr/local/node_modules/ua-parser/server.js

Server.js

// Adapted from https://www.npmjs.com/package/ua-parser for use as a JSON api in Graylog, to extract user agent strings

var http = require('http');
var url = require('url');
var uaparser = require('ua-parser')
 
var server = http.createServer(function (req, res) {
	
	var parsedUrl = url.parse(req.url, true);
	
	// The function expects an un-escaped string, but most of my logs have a "+" sign instead of a space
	// Escape the string so that I can do an easy regex to replace all "+" signs, then un-escape it again for the function
	var cleanstring = escape(parsedUrl.query.query).replace(/\+/g,"%20");

	// Now run the uaparser function
	var ua = uaparser.parse(unescape(cleanstring));
	
	// Finally, turn it into a JSON query for easy consumption by Graylog and return the response
	var json_response = JSON.stringify(ua);
	res.statusCode = 200;
	res.end(json_response);
   
});
server.listen(5001, () => console.log("Server running at port 5001"));

Data adaptor
image

Lookup table
image

Pipleline rule

rule "extract_user_agent"
when
    has_field("user_agent_string") 
then
    let ua = lookup("lookup_user_agent", to_string($message."user_agent_string"));
	
	set_field("user_agent_string", ua["string"]);
	set_field("user_agent_app_family", ua["userAgent"].family);
	set_field("user_agent_app_version_major", ua["userAgent"].major);
	set_field("user_agent_app_version_minor", ua["userAgent"].minor);
	set_field("user_agent_app_version_patch", ua["userAgent"].patch);
	
	set_field("user_agent_os_family", ua["os"].family);
	set_field("user_agent_os_version_major", ua["os"].major);
	set_field("user_agent_os_version_minor", ua["os"].minor);
	set_field("user_agent_os_version_patch", ua["os"].patch);	

	set_field("user_agent_device_family", ua["device"].family);
end

I hope that helps someone else - I’ve found these forums very helpful so far, most of questions already answered, so happy to contribute back :smile:

1 Like

Not sure if this exists already but the Graylog team should really kick up a wiki of sorts where we can put little cookbook recipes like this :smiley:

I agree - but currently we could only create a sticky post that can be edited with that in this community.

But for the above I created this feature issue: https://github.com/Graylog2/graylog2-server/issues/5727

Sticky posts would also work, but then we may need a “best-of” type section on the forums here where posts like this can get archived or stickied :slight_smile:

But good to see the request is in! :smiley:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.