Fielddata automatically enabled for "source" field

Hi folks,

I accidentally found out that Graylog is enabling fielddata for the “source” field by default in the Elasticsearch mapping. I was looking at the memory usage of Elasticsearch segments and noticed that there was a relative high percentage of memory usage per segment caused by the “source” field.

So I tried to find out why Graylog is using fielddata for the “source” field but couldn’t find anything. The Graylog documentation is also implying that it’s not being used:

"source" : {
    "analyzer" : "analyzer_keyword",
    "index" : "analyzed",
    "type" : "string"
    },

But our index template in Elasticsearch used by Graylog is using it:

"source": {
    "type": "text",
    "analyzer": "analyzer_keyword",
    "fielddata": true
    },

If I read the Graylog source code correctly I think this means fielddata is being used for the field “source”:

.put("source", analyzedString("analyzer_keyword", true))
method def analyzedString(String analyzer, boolean fieldData)

Can anyone explain to my why this is being used? And if it’s really needed, why we don’t use doc_values instead of fielddata as Elasticsearch advises?

Thanks in advance. Tim.

Hi Tim - Could you please help us with the link, where elasticsearch is implying to use doc_values instead of fielddata ?

Hi Priya,

I tried in my last post, but as a newby I can only use two URLs in my post.

“Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.”…
…“Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:”

{
  "mappings": {
    "properties": {
      "my_field": { 
        "type": "text",
        "fields": {
          "keyword": { 
            "type": "keyword"
          }
        }
      }
    }
  }
}

So taking the Graylog mapping in mind it would look something like this:

"source": {
    "type": "text",
    "analyzer": "analyzer_keyword",
    "fields": {
          "keyword": { 
            "type": "keyword"
          }
       }
    },

doc_values are enabled for keywords by default and are stored on disk. If used frequently they are cached in memory.

fielddata are doc_values for “text” fields and are always stored in memory. It is a relative expensive process.

But again, I’m not really sure why Graylog is using it so maybe there’s a logical reason for using fielddata.

In the Graylog source code someone wrote a note above the part where fielddata is used (link in first post) which maybe is part of the reason:
// to support wildcard searches in source we need to lowercase the content (wildcard search lowercases search term)

But the note is talking about searching. That’s probably why the analyzer_keyword is there. Both doc_values and fielddata is there for e.g. aggregations and sorting, not searching.

tim - i’m a newby to elastic search. i’m reading the documents , along with you. wondering, if we can ask some simple questions and navigate thru the problem you are facing. I appreciate, taking your time to write to us.
problem statement : high memory usage per segment caused by “source” field, since “fielddata” attribute is set to “true”.

solution 1 : posting a link below on how to limit memory usage, it requires tweaking to your elasticsearch.yml file.
https://www.elastic.co/guide/en/elasticsearch/guide/master/_limiting_memory_usage.html

Hi Priya,

Thank you for all the research you’re already doing. Just to clarify my point: we don’t have memory issues at the moment. I just noticed that Graylog is using fielddata where it could use doc_values which is probably more efficient.

So the only thing I was really interested in: Why did the Graylog developers choose to use fielddata? And why only on the source field?

I noticed because I was doing other improvements on our logging setup and was looking into different aspects of Elasticsearch memory usage.

If I read the URL you’ve send I’m noticing a lot of safety measures and workarounds for safely using fielddata. Elastic keeps on noting that it’s not the best and most efficient way to accomplish something. Because if you ever aggregate (e.g. use the quickvalues in Graylog), Elasticsearch will load the necessities (inverted inverted index) in memory which will stay there until the node is rebooted or dies. So hitting the limits you’re referring too in your URL will probably only cause functionalities to stop working and helping Elasticsearch from reaching out of memory issues.

That’s why they advise to add an extra keyword field via the Elasticsearch mapping because these fields give you automatically the functionality fielddata gives you, but without using memory because there are stored on disc and are only cached when you use them (frequently).

But again. I don’t know the reason why Graylog uses fielddata yet, so maybe it’s really there for a reason.

Hi Tim - Its awesome, that you are not having memory issues at the moment.
Is your index template, that you are referring to , is this a custom template ?

No, just the default mapping generated automatically by Graylog.

Tim - Just so we are on the same page. your link to elastic search is pointing to version 7.6 and the latest Graylog elasticsearch version is at 6.8.x. In the elastic search site, current refers to -> 7.6.
Also, would you like to share a picture of how your cluster ,index set configuration of your environment looks like ?
Are you at version 3.1 or 3.2 of Graylog ?

Bookmark this.

Hi Priya, I’ll reply to your comment, but your comment is not clarifying my question. If you don’t understand my question please feel free to say so. Then I can try to explain it in a different way.

To answer your comment: on the right side of the Elasticsearch documentation page you can change the Elasticsearch version it’s applying to under Elasticsearch Reference:. If you do so, you’ll see it doesn’t really change for the master version and the version Graylog is using. And to the second questions of your comment: it doesn’t matter which version I’m running because it’s something I noticed Graylog is doing in general, and not for a specific versions. Again, we don’t have any problems at the moment, I just had a fair question (my first post).

So can you please read my first post again and tell me if you understand the question I’m asking there, or not? I can explain it differently if needed. I’m referring a few times to Graylog documentation and source code and my question is related to why some decisions are made and why there’s a difference in documentation and source code.

Tim - I want to pay attention to your problem which does not exist at this moment. you have gone pretty deep in analyzing many aspects of the graylog ecosystem “sourcecode, documentation…etc”

How did your problem of high percentage of memory usage per segment caused by the “source” field go away ?

I donot know where to look for the “memory usage of Elasticsearch segments” . Please advice. A screen shot will be helpful.

you do have a very fair question. :grinning:

1 Like

Hi Priya!

The “problem” didn’t go away. And for now it’s not really a problem because Graylog is only doing it on one specific field, the source field. But it’s probably doing it for everyone using Graylog at the moment. And it’s probably not a problem for those users as well, because Graylog is only doing it for one field. If you google on memory_size_in_bytes elasticsearch you’ll immediately see questions about fielddata at the top, so it can be a problem…

Here some rest calls to see how much memory fielddata is using in your Elasticsearch cluster:

GET graylog_generic_18/_stats/fielddata?fields=* (This will give all fielddata memory_size_in_bytes for a specific index)
You can see here that the source field is using way more memory then all others.

GET /_stats/fielddata?fields=source (This will give all fielddata memory_size_in_bytes for a specific field on all indices)

For me, on the latest Graylog index the source field is using 108725448 bytes, and all other fields are using 0 or something around 4000 bytes.

If I’m looking at one specific Elasticsearch node for all metrics about fielddata I see that 99.2% of all memory used by fielddata is going the the source field:
GET _nodes/NODEID/stats/indices/fielddata?fields=source

Again, it not a problem at the moment. Not for me, probably not for other Graylog users as well. But I just wanted to know why Graylog is doing it this way. Now Elasticsearch is using (unnecessarily?) memory for this field where it doesn’t have too, I think. But I don’t now for sure because I don’t know why Graylog is using it, yet?

2 Likes

Hi Tim - is it possible for you to override the source field (HTTP pUT) to use doc_values and see if the problem goes away ?

I’m not in sync with the environment you are running. The problem space is vast and
monitoring ES takes time . Possibly someone else in the community will be able to pitch in.

  • how many inputs you have configured ?
  • what values does your source field point to ?
  • how does your index set look like and what is your index rotation strategy ?
  • How many clusters are you running etc…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.