Stream rule to filter out "bots"


#1

Hello,

please give a hint how to filter out messages from an nginx stream based on the filed http_user_agent. I tried the following inverted rules:

  • must contain bot (also tried Bot)
  • must match regex bot (also tried Bot, /Bot/)
    Messages still showing up in the stream.
    Sample http_user_agent fields:
    http_user_agent
    Mozilla/5.0 (compatible; BLEXBot/1.0; +http://webmeup-crawler.com/)

Thanks!


(Prakash A) #2

@Marcell, Could you please share one sample message from your logs. That will help to add a stream rule in a stream. Also here I have an added my sample stream rule for your reference. I hope this will resolve your query.

stream%20rules


#3

@prakasha, here you go:


(Ben van Staveren) #4

You have to adjust your stream rules that put the nginx messages in to the stream and add a rule where you do a regex on /bot/ - then select the ‘inverted’ flag (which basically turns the rule into a “if it does not match bot”), then any messages that have ‘bot’ in the user agent field will not go to your nginx stream, but will still end up in ‘all messages’.

If you want to totally remove that, instead of changing your stream rules, attach a pipeline to the nginx stream, and write a rule sort of like this (pseudo-code-ish, look at the docs for more)

rule "forget about bots"
when
  has_field("http_user_agent") && 
  contains(value: to_string($message.http_user_agent), search: "bot", ignore_case: true)
then
  drop_message();
end

That will basically just drop all messages where the user agent contains the word ‘bot’ into the big black hole so they’ll never be stored anywhere.


#5

@benvanstaveren that is exactly what I did however the result is:


I’m I missing something?


#6

As a temporary solution I allowed the “leading_wildcard_searches”, and added an inverse stream rule matching regex *bot*
I’m aware that this is a resource hungry solution, so any further advice is welcome!


(Megan) #7

Marchell, what is your order of processing under configurations? Message Filter Chain > Pipeline Processing or Pipeline Processing > Message Filter Chain?


(Ben van Staveren) #8

Heuh… okay that’s weird, I would’ve expected that rule as I described it to work but I must’ve missed something there :frowning:


#9

@megan201296

  1. Pipeline processor
  2. AWS Instance Name Lookup
  3. Message filter chain
  4. GeoIP Resolver

(Ben van Staveren) #10

Ah, then you may need to alter the order of the processing, or do the pipeline rule thing, not sure which one of the two options I mentioned you use din the end :smiley:


(Megan) #11

Based on the current order, the method of stream rules should work. Just wanted to check as order of processing sometimes gets people. I’ll try to look closer later and try to identify whether there are other possible issues.