ElasticSearch Data (un-)Balance

All,

We are performing perform testing with big variety of loads. I have set the index being rotated hourly. No some of the ES data node run out of space and graylog stop work with unassigned shards. How can I fixed it? It seems that ES balanced by the counts of shards not disk usage.
Question:
How can I fix it to resume graylog?
How can I avoid the unbalanced disk usage?

shards disk.indices disk.used disk.avail disk.total disk.percent host ip node
112 659.5gb 1.9tb 2.3gb 1.9tb 99 10.233.110.35 10.233.110.35 graylog3-elasticsearch-data-8
113 843.1gb 946.5gb 1tb 1.9tb 46 10.233.110.34 10.233.110.34 graylog3-elasticsearch-data-2
111 659.4gb 1.9tb 4gb 1.9tb 99 10.233.113.32 10.233.113.32 graylog3-elasticsearch-data-3
115 838.3gb 942.1gb 1tb 1.9tb 46 10.233.64.12 10.233.64.12 graylog3-elasticsearch-data-6
114 1tb 1.1tb 797.3gb 1.9tb 60 10.233.113.33 10.233.113.33 graylog3-elasticsearch-data-0
113 656.9gb 1.9tb 0b 1.9tb 100 10.233.109.33 10.233.109.33 graylog3-elasticsearch-data-7
114 1tb 1.1tb 806gb 1.9tb 59 10.233.87.3 10.233.87.3 graylog3-elasticsearch-data-11
114 848gb 951.2gb 1tb 1.9tb 47 10.233.110.33 10.233.110.33 graylog3-elasticsearch-data-5
113 1tb 1.1tb 806gb 1.9tb 59 10.233.75.20 10.233.75.20 graylog3-elasticsearch-data-10
111 1tb 1.1tb 811gb 1.9tb 59 10.233.109.32 10.233.109.32 graylog3-elasticsearch-data-1
112 660.7gb 1.9tb 2.2gb 1.9tb 99 10.233.64.13 10.233.64.13 graylog3-elasticsearch-data-9
114 839gb 942.4gb 1tb 1.9tb 46 10.233.109.31 10.233.109.31 graylog3-elasticsearch-data-4
8 UNASSIGNED

graylog_341 1 p UNASSIGNED ALLOCATION_FAILED
graylog_341 3 p UNASSIGNED ALLOCATION_FAILED
graylog_341 2 p UNASSIGNED ALLOCATION_FAILED
graylog_341 0 p UNASSIGNED ALLOCATION_FAILED
graylog_338 1 p UNASSIGNED ALLOCATION_FAILED
graylog_338 3 p UNASSIGNED ALLOCATION_FAILED
graylog_338 2 p UNASSIGNED ALLOCATION_FAILED
graylog_338 0 p UNASSIGNED ALLOCATION_FAILED

More observation:

The indices with big volumes are not the issue. They are allocated in the nodes without issue. The nodes which are full have high disk.used but the disk.indices are low. What could be the cause? How can we reclaim the disk space not used by indices?

Tiger

curl -XGET localhost:9200/_cluster/allocation/explain?pretty explains the reason and provide suggestion of running
curl -s -XPOST 'localhost:9200/_cluster/reroute?retry_failed=true'. It fixed the issue.
However the Graylog still has no message out.

Journal files corruption cause the issue.
Deleting the journal files and restart Graylog nodes and the cluster started to work.
As I am using helm/kubernetes, so I “[c]hange[d] graylog.journal.deleteBeforeStart to true to delete all journal files before start”; and redeploy graylog pod.
(After it back to normal, reset graylog.journal.deleteBeforeStart to false; and redeploy on more time)

it might be that you run into the limits that are given according to this blog:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.