I have a 3 node cluster , all 3 running both graylog and elastrcsearch.Version is 2.2.1
Traffic is about 10 000 msg/sec and when all 3 nodes are running it runs smooth. The indexes are configured
with REplica = 0, because data takes a lot of disk space.
But if I need to do some maintenance or for other reasdon I need to take away one of the 3 clusters, then
the remaining to still process messages, but more slower, about 5000 /sec of them. They both get 100 000s of msgs in journal queue and in their graylog log there are errors:
2017-04-05T11:13:13.573+02:00 WARN [BlockingBatchedESOutput] Error while waiting for healthy Elasticsearch clus
ter. Not flushing.
java.util.concurrent.TimeoutException: Write-active index didn’t get healthy within timeout
I tried to simulate this problem on similar test environment with much less traffic. I could also duplicate the above error
there. But then I changed the index replica from 0 to 1. Then the cluster remained more stabel also on test and the erro didint happen.
SO it seems that the solution to enable more stability in elasticsearch cluster seems to be to set index Replica to 1.
By doing so the elasticsearch cluster on the remaining 2 nodes can recover from replicas and get healthy.
Is there any other solution to increase elasticsearch health, to enable it to run well in case one of nodes goes down.