Can we use Elasticsearch false or strict with Graylog?

(Chris Thompson) #1

My graylog index has had a mapping explosion where there are 4-5k fields in the index. This is preventing me from upgrading to Elasticsearch 5.6 (from 2.4.5) and thus to Graylog v3.0 (from Graylog 2.5).
I loaded the ’ Elasticsearch Migration Helper v2.0.4’ which says I have too many fields to upgrade. The default limit is 1000.
I hesitate to set it to 6000 and that would not solve the problem of unwanted fields being created dynamically.
So, I’m wondering, how do I use Elasticsearch’s false or strict mapping modes with Graylog?
https://www.elastic.co/guide/en/elasticsearch/guide/2.x/dynamic-mapping.html

(Jan Doberstein) #2

Graylog would not work … if you disable dynamic mapping.

If you have to many fields - split data into different indices, for example for windows and linux into different index sets. Or use the power of processing pipelines to delete unwanted/needed fields before they are ingested to Elasticsearch.

(Chris Thompson) #3

Ok, that makes sense, I will read up on pipelines and try that. With separate indexes I get that each type of log can create up to 1000 fields per index in ES >=5.6.

The issue is our website logs have complicated HTTP headers which create thousands of nonsense fields like: _X, xb, zg, _v, etc so there’s no way to predict what fields I would want to remove with the pipeline. They are constantly changing. Maybe there’s a way to do that with the pipeline… gotta read the docs.

Seems like creating a custom template will not help with this issue right? Because that’s for mapping detected fields to a type, not limiting the fields that can be used?
Thanks!

(Jan Doberstein) #4

The issue is our website logs have complicated HTTP headers which create thousands of nonsense fields like: _X, xb, zg, _v, etc so there’s no way to predict what fields I would want to remove with the pipeline. They are constantly changing. Maybe there’s a way to do that with the pipeline… gotta read the docs.

So why not extract the information you like to have, create a new message out of that and drop the original message?

(Chris Thompson) #5

Thanks so much for your advice about using streams to send each log source to it’s own Elasticsearch index. That is a far better way to set this up and is helping me at least isolate the problem. Graylog is deep, I have a lot to learn XD.

To answer your question:
I am using the ‘replace with regex’ extractor to replace commas with spaces along with the key=value converter to create the fields for our WAF logs. (We have other sources which use normal extractors.) We did this for convenience and because it performed far better than using GROK/Regex to weed thru these enormous log entries and extract 20-30 fields (which is how we started out).

Our WAF which handles a metric ton of attacks and I think sometimes the key=value converter is taking arbitrary junk variables the attackers throw at us and turning them into fields. Otherwise it’s in some cases interpreting anything around an ‘=’ sign, even if there is no space delimiter, as a key=value pair and extracting it?

Given what’s happening do I have a choice aside from building a detailed extractor for just the fields we want? I can’t throw the rest of the message away because we need it for forensics.

(Jan Doberstein) #6

Otherwise it’s in some cases interpreting anything around an ‘=’ sign, even if there is no space delimiter, as a key=value pair and extracting it?

The extractor does exactly that - So the extractor key-value is not that powerful. The processing pipeline key-value function can be configured more deeply:

key_value(value: to_string(message), trim_value_chars: "\"", trim_key_chars:"\"", delimiters:" ", kv_delimiters:"=");

This would prefend the = in the value is used to separate a new value from key.

You might get better results using the procesing pipelines.

(Chris Thompson) #7

I did get the messages from our various devices divided into separate indices by adding static fields to the inputs and using stream rules to match those and direct the data to the appropriate index.
When I was setting graylog up, there were just too many moving parts to absorb using streams and multiple indices. Now that I have, I can see that it’s a much better config.

I have not used the processing pipelines before but I will try it now, thnx!

(Chris Thompson) #8

I built a pipeline based on your suggestion that is working so far and that I think is specific enough to NOT parse the contents of cookies/query strings/etc into fields:

rule "http request extraction"
when
  has_field("message")
then
  let rawmsg = to_string($message.message);
  let pairs = key_value(value: rawmsg, trim_value_chars: "\"", trim_key_chars:"\"", delimiters:",", kv_delimiters:"=");
    set_fields(pairs);
end

That is breaking all the key1=“value1”,key2=“value2”,… out into mappings nicely. I used ‘message’ as the test because I want it to act on every message in the stream.

Seems like it might be a good idea to have a 2nd stage to set the Elasticsearch datatype for each field.
If you have any suggestions for refining this, please comment!