Runaway Index and allocation failure

Well this is weird. GL has been in our organisation for 18 months (kept up-to-date), and we’ve never had any problems of note.

Today however, I came in to the office after the weekend, and my cluster-state was red. I had a quick look at my indexes, and noticed that the current active index had not cycled. My max index size is 35GB, and this one had managed to rock up to 217GB.

Is there anywhere in particular I should be looking for the route cause of this? My ES cluster was at 90% capacity at the time (so around 400GB left on each of the three ES nodes).

I tried to manually rotate the active write index, but that just seemed to create more unassigned shards. I then deleted around 5 indexes from the bottom of the pile, and they started re-assigning. Weird, as to me it looked like there was plenty of space left on the ES cluster, but more importantly, what would cause it to not rotate in the first place? Is there some sort of limit imposed on the amount of ES storage that can be used possibly?

Images attached.

Many Thanks,
Tom

gl02gl01

The cluster health state of your Elasticsearch cluster is RED (also see http://docs.graylog.org/en/2.4/pages/configuration/elasticsearch.html#cluster-status-explained).

Check the logs of your Elasticsearch node(s) and make sure that the cluster state is YELLOW or GREEN (recommended). After that, Graylog should be able to rotate the indices again.

1 Like

is your retention strategy time based? If yes - something that send the logfiles to Graylog was set to debug or some host had serious problems and yelled that into the logfiles.

ES tends to begin to complain as filesystems where your ES data is stored begins exceeds the settings for low/high disk space.

By default, as a node’s filesystem used for ES reaches higher than 90%, ES will stop allocating data to that node. You can continue to read data from that node, but no more shards will be created. This number can be raised by modifying your Elasticsearch config files.

I am guessing that this is your issue - can’t know for sure without seeing log messages.

https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html

Ways to get out of the issue:

  1. Add another node to your cluster so there is more space available. Not a perfectly simple operation, but depending on how your indexes are configured it can automatically move things around.
  2. Adjust the watermark settings so your ES node can use a little more space (instead of 90% maybe 93%).
  3. Delete old indexes to get you back under the mark.

Anyway - logs from your ES nodes(s) will provide more information about the issue.

Dustin

1 Like

Yep, bang on. Than you Dustin.
My low watermark was 90% and I’d recently increased my retention policy slightly.

That was enough to push it just over.

Now corrected and all clear.

Appreciate your help.

Many Thanks,
Tom

Glad to hear it worked out!!

Dustin

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.