ElasticSearch Data (un-)Balance

Tiger · January 24, 2020, 2:58pm

All,

We are performing perform testing with big variety of loads. I have set the index being rotated hourly. No some of the ES data node run out of space and graylog stop work with unassigned shards. How can I fixed it? It seems that ES balanced by the counts of shards not disk usage.
Question:
How can I fix it to resume graylog?
How can I avoid the unbalanced disk usage?

shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
112 659.5gb 1.9tb 2.3gb 1.9tb 99 10.233.110.35 10.233.110.35 graylog3-elasticsearch-data-8
113 843.1gb 946.5gb 1tb 1.9tb 46 10.233.110.34 10.233.110.34 graylog3-elasticsearch-data-2
111 659.4gb 1.9tb 4gb 1.9tb 99 10.233.113.32 10.233.113.32 graylog3-elasticsearch-data-3
115 838.3gb 942.1gb 1tb 1.9tb 46 10.233.64.12 10.233.64.12 graylog3-elasticsearch-data-6
114 1tb 1.1tb 797.3gb 1.9tb 60 10.233.113.33 10.233.113.33 graylog3-elasticsearch-data-0
113 656.9gb 1.9tb 0b 1.9tb 100 10.233.109.33 10.233.109.33 graylog3-elasticsearch-data-7
114 1tb 1.1tb 806gb 1.9tb 59 10.233.87.3 10.233.87.3 graylog3-elasticsearch-data-11
114 848gb 951.2gb 1tb 1.9tb 47 10.233.110.33 10.233.110.33 graylog3-elasticsearch-data-5
113 1tb 1.1tb 806gb 1.9tb 59 10.233.75.20 10.233.75.20 graylog3-elasticsearch-data-10
111 1tb 1.1tb 811gb 1.9tb 59 10.233.109.32 10.233.109.32 graylog3-elasticsearch-data-1
112 660.7gb 1.9tb 2.2gb 1.9tb 99 10.233.64.13 10.233.64.13 graylog3-elasticsearch-data-9
114 839gb 942.4gb 1tb 1.9tb 46 10.233.109.31 10.233.109.31 graylog3-elasticsearch-data-4
8 UNASSIGNED

graylog_341 1 p UNASSIGNED ALLOCATION_FAILED
graylog_341 3 p UNASSIGNED ALLOCATION_FAILED
graylog_341 2 p UNASSIGNED ALLOCATION_FAILED
graylog_341 0 p UNASSIGNED ALLOCATION_FAILED
graylog_338 1 p UNASSIGNED ALLOCATION_FAILED
graylog_338 3 p UNASSIGNED ALLOCATION_FAILED
graylog_338 2 p UNASSIGNED ALLOCATION_FAILED
graylog_338 0 p UNASSIGNED ALLOCATION_FAILED

Tiger · January 24, 2020, 4:50pm

More observation:

The indices with big volumes are not the issue. They are allocated in the nodes without issue. The nodes which are full have high disk.used but the disk.indices are low. What could be the cause? How can we reclaim the disk space not used by indices?

Tiger

Tiger · January 24, 2020, 7:41pm

curl -XGET localhost:9200/_cluster/allocation/explain?pretty explains the reason and provide suggestion of running
curl -s -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true'. It fixed the issue.
However the Graylog still has no message out.

Tiger · January 25, 2020, 4:56pm

Journal files corruption cause the issue.
Deleting the journal files and restart Graylog nodes and the cluster started to work.
As I am using helm/kubernetes, so I “[c]hange[d] graylog.journal.deleteBeforeStart to true to delete all journal files before start”; and redeploy graylog pod.
(After it back to normal, reset graylog.journal.deleteBeforeStart to false; and redeploy on more time)

jan · January 26, 2020, 7:01pm

it might be that you run into the limits that are given according to this blog:

system · February 9, 2020, 7:01pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog cluster, elasticsearch unassigned shards Graylog Central (peer support)	4	3005	May 4, 2021
Runaway Index and allocation failure Graylog Central (peer support)	6	988	April 24, 2018
Even now still confused over relationship between RED/YELLOW and graylog Graylog Central (peer support)	2	552	September 30, 2017
HELP, its all stopped working! Graylog Central (peer support)	31	7863	March 27, 2017
Tfw you deleted elasticsearch shards to free up space Graylog Central (peer support)	3	2499	July 7, 2017

ElasticSearch Data (un-)Balance

Related topics