EXTRACTOR: How to extract domain name from referrer field of Apache access logs?


#1

Hi,

I have successfully set up an extractor for a Graylog input that is getting apache2 access logs.

The grok pattern I used is:

"%{DATA:clientips}" %{HTTPDUSER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:statuscode} (?:%{NUMBER:bytes}|-) %{QS:referrer} %{QS:agent}

But now, for the referrer field, I also want to extract the domain name part.
e.g. assume referrer = https://www.cnn.com/A/B/C/D

I want another field called referrer_domain_name which would be set to “https://www.cnn.com” (or just www.cnn.com without the protocol)

How would I do this when I can only set one GROK pattern in the setup page for extractors?
i.e., I’d like both the referrer and referrer_domain_name fields, but the pattern for referrer_domain_name would extract on referrer after that was extracted from the original logs.

Thanks for any suggestions


(Jan Doberstein) #2

For the referrer field, you have multiple options to split that up.

The Included Pattern URI includes the following pattern:

%{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM})?

You could now build your own custom pattern that only contains your wanted fields like:

%{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?

Give that a name like REF_DOMAIN and use that to extract the content of the field:

%{REF_DOMAIN:referrer_domain}

That would be one possible solution you can use.


#3

Wonderful @jan

Is there a way to have both the REF_DOMAIN and REFERRER fields?
i.e., a field that is a subfield of another one

Thanks again


(Jan Doberstein) #4

But that is already possible.

you first run your extractor - then you run in the field referrer the grok extractor I had provided. and you have both.


#5

Thanks @jan

That works, but is there a way to exclude these other GROK fields that were used to build up the overall pattern you designed?

i.e., I also see these fields being extracted in my messages:
HOSTNAME
IPORHOST
URIHOST
URIPARAM
URIPATH
URIPATHPARAM
URIPROTO
…etc…


(Jan Doberstein) #6

you need to check the mark for “extract named fields only” …


#7

Hi Jan,
When I do that, I lose the ref_domain field too/

These are my grok patterns (with your help of course):
1) REFERRER : %{REF_DOMAIN}(?:%{URIPATHPARAM})?
2) REF_DOMAIN : %{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?


(Jan Doberstein) #8

then you might have chained it wrong together. Do not do this completely in one extractor. You want multiple or at least two. The other option would be to name the REF_DOMAIN pattern:

REFERRER : %{REF_DOMAIN:ref_domain}(?:%{URIPATHPARAM:ref_uripatch})?

and you will have extracted both the domain and the uripath in two fields. If you in addition need them combined you should have two extractors. One with the initial referrer and the second with the modified one.


(system) #9

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.