Fielddata automatically enabled for "source" field

Hi folks,

I accidentally found out that Graylog is enabling fielddata for the “source” field by default in the Elasticsearch mapping. I was looking at the memory usage of Elasticsearch segments and noticed that there was a relative high percentage of memory usage per segment caused by the “source” field.

So I tried to find out why Graylog is using fielddata for the “source” field but couldn’t find anything. The Graylog documentation is also implying that it’s not being used:

"source" : {
    "analyzer" : "analyzer_keyword",
    "index" : "analyzed",
    "type" : "string"
    },

But our index template in Elasticsearch used by Graylog is using it:

"source": {
    "type": "text",
    "analyzer": "analyzer_keyword",
    "fielddata": true
    },

If I read the Graylog source code correctly I think this means fielddata is being used for the field “source”:

.put("source", analyzedString("analyzer_keyword", true))
method def analyzedString(String analyzer, boolean fieldData)

Can anyone explain to my why this is being used? And if it’s really needed, why we don’t use doc_values instead of fielddata as Elasticsearch advises?

Thanks in advance. Tim.

Hi Tim - Could you please help us with the link, where elasticsearch is implying to use doc_values instead of fielddata ?

Hi Priya,

I tried in my last post, but as a newby I can only use two URLs in my post.

https://www.elastic.co/guide/en/elasticsearch/reference/current/fielddata.html

“Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.”…
…“Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:”

{
  "mappings": {
    "properties": {
      "my_field": { 
        "type": "text",
        "fields": {
          "keyword": { 
            "type": "keyword"
          }
        }
      }
    }
  }
}

So taking the Graylog mapping in mind it would look something like this:

"source": {
    "type": "text",
    "analyzer": "analyzer_keyword",
    "fields": {
          "keyword": { 
            "type": "keyword"
          }
       }
    },

doc_values are enabled for keywords by default and are stored on disk. If used frequently they are cached in memory.

fielddata are doc_values for “text” fields and are always stored in memory. It is a relative expensive process.

But again, I’m not really sure why Graylog is using it so maybe there’s a logical reason for using fielddata.

In the Graylog source code someone wrote a note above the part where fielddata is used (link in first post) which maybe is part of the reason:
// to support wildcard searches in source we need to lowercase the content (wildcard search lowercases search term)

But the note is talking about searching. That’s probably why the analyzer_keyword is there. Both doc_values and fielddata is there for e.g. aggregations and sorting, not searching.

tim - i’m a newby to elastic search. i’m reading the documents , along with you. wondering, if we can ask some simple questions and navigate thru the problem you are facing. I appreciate, taking your time to write to us.
problem statement : high memory usage per segment caused by “source” field, since “fielddata” attribute is set to “true”.

solution 1 : posting a link below on how to limit memory usage, it requires tweaking to your elasticsearch.yml file.
https://www.elastic.co/guide/en/elasticsearch/guide/master/_limiting_memory_usage.html

Hi Priya,

Thank you for all the research you’re already doing. Just to clarify my point: we don’t have memory issues at the moment. I just noticed that Graylog is using fielddata where it could use doc_values which is probably more efficient.

So the only thing I was really interested in: Why did the Graylog developers choose to use fielddata? And why only on the source field?

I noticed because I was doing other improvements on our logging setup and was looking into different aspects of Elasticsearch memory usage.

If I read the URL you’ve send I’m noticing a lot of safety measures and workarounds for safely using fielddata. Elastic keeps on noting that it’s not the best and most efficient way to accomplish something. Because if you ever aggregate (e.g. use the quickvalues in Graylog), Elasticsearch will load the necessities (inverted inverted index) in memory which will stay there until the node is rebooted or dies. So hitting the limits you’re referring too in your URL will probably only cause functionalities to stop working and helping Elasticsearch from reaching out of memory issues.

That’s why they advise to add an extra keyword field via the Elasticsearch mapping because these fields give you automatically the functionality fielddata gives you, but without using memory because there are stored on disc and are only cached when you use them (frequently).

But again. I don’t know the reason why Graylog uses fielddata yet, so maybe it’s really there for a reason.

Hi Tim - Its awesome, that you are not having memory issues at the moment.
Is your index template, that you are referring to , is this a custom template ?

No, just the default mapping generated automatically by Graylog.