After many months of problem-free working, I’ve now got a problem that has blocked basically all use of graylog for search.
Now, any time I try to do a “Quick Value” sort on some data, I get the dreaded red popup at the bottom and the ES cluster starts reporting
Caused by: java.lang.IllegalArgumentException: Fielddata is disabled on text fields by default. Set fielddata=true on [application_name] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
(that is an example of me trying to sort some data by the “application_name” field)
I’ve now altered the “/_template/graylog-custom-mapping” mapping to add “fielddata: true” to that field (and a couple of others causing the same error), confirmed with curl that the change occurred, and rotated the index, but even though it’s now been 4 hours since doing this and the system has itself rotated to a new index, it still can’t sort on “application_name” even over the last five minute period. That doesn’t sound right?
If I look at “/system/index_sets”, I also notice that now the old indices don’t appear correct time-wise. I’m pushing syslog+GELF data into graylog and I’d expect the “newest” older index to show something like “Contains messages from 2 hours ago up to in 5 hours”, but instead the first 20+ indices all say “Contains messages from 2 months ago up to in 6 months”. ie the timestamps seems completely wrong.
Actually, I just ran a standard search over the past 5 minutes, and now notice the comment “Search result Found 975,472 messages in 761 ms, searched in 655 indices.”. 655 indices? Surely that should be 1 or maybe 2 indices?
Any ideas what’s gone wrong? These are CentOS-7 systems running graylog-server-3.0.1-2.noarch and elasticsearch-5.6.16-1.noarch (4-node cluster) from official repos.
There must be something wrong with these indices: check out this one created yesterday (shown on the graylog “/system/index_sets/xxxxx” page)
graylog_10154 Contains messages from 49 years ago up to in 6 months
The Inputs are syslog and GELF messages and so I’d expect it to cover a few hours from yesterday - but if we had some corrupt GELF data coming in with bad timestamps, I can imagine them showing up as “0” as that’s 49 years ago (ie Jan 1970). But the “up to in 6 months” is plain IMPOSSIBLE. There would definitely be legitimate data with yesterday’s timestamp in there - even if only one record - so how can that index claim to not have anything newer than 6 months old???
OK… Well the systems are already at 64G RAM each with ES_JAVA_OPTS="-Xms31g -Xmx31g", so am I correct in saying my options are to either add more cluster nodes (so that the average number of shards/node is reduced) - or to reduce the max number of indices (ie data) I’m willing to keep?
What I don’t understand is that if this was a RAM starvation issue, why does that have anything to do with this weird index behaviour? ie why is graylog searching through 600 40G indices (in my case) for records that were added in the past 5 minutes? (ie all that data would either be in the current index or maybe the one before it). By always searching through basically every index for every search term, it’s no wonder it’s struggling for resources. What I don’t get is that it didn’t behave like this before…
Graylog will search all indices that can hold data for the time period you are searching in.
It has performed a “min/max” query on all indices and knows the range that each index holds. When you get a 5 minute search over hundreds of indices that indicates that you have messages that have wrong timestamps in them and let Graylog think they might have data it needs to search in.
Yeah I believe that to be true, but what I don’t get is that within the graylog “System->Indices” area, which shows each index, the details dropdown for the newest non-active index shows “Contains messages from 2 months ago up to in 6 months” instead of “Contains messages from 1 hour ago up to in 6 months”. I know for a fact data with a current timestamp is entering successfully (doing a simple search shows that is true), so how can the newest indices claim they only contain old data? Even a single correct record should stop that being the case.
Not a single index claims to contain data from the past few weeks, and yet graylog search still works - it just reads a tonne more indices than it should