Cluster stability depends on index replicas?

lecko · April 13, 2017, 1:25pm

Hello,

I have a 3 node cluster , all 3 running both graylog and elastrcsearch.Version is 2.2.1

Traffic is about 10 000 msg/sec and when all 3 nodes are running it runs smooth. The indexes are configured
with REplica = 0, because data takes a lot of disk space.

But if I need to do some maintenance or for other reasdon I need to take away one of the 3 clusters, then
the remaining to still process messages, but more slower, about 5000 /sec of them. They both get 100 000s of msgs in journal queue and in their graylog log there are errors:

2017-04-05T11:13:13.573+02:00 WARN [BlockingBatchedESOutput] Error while waiting for healthy Elasticsearch clus
ter. Not flushing.
java.util.concurrent.TimeoutException: Write-active index didn’t get healthy within timeout

I tried to simulate this problem on similar test environment with much less traffic. I could also duplicate the above error
there. But then I changed the index replica from 0 to 1. Then the cluster remained more stabel also on test and the erro didint happen.

SO it seems that the solution to enable more stability in elasticsearch cluster seems to be to set index Replica to 1.
By doing so the elasticsearch cluster on the remaining 2 nodes can recover from replicas and get healthy.
Is there any other solution to increase elasticsearch health, to enable it to run well in case one of nodes goes down.

jochen · April 13, 2017, 1:52pm

I’m not sure why this isn’t obvious, but if you don’t have any replicas in your Elasticsearch cluster, you have no redundancy at all and all nodes must be available all the time, especially if all nodes are storing primary shards of all indices.

lecko · April 13, 2017, 1:58pm

Thanks Jochen, yes it is obvious, but it helps to be confirmed by community an I thought maybe there is some other solutuion.

Topic		Replies	Views
Default index set 180 indices, 2,579,511,730 documents, 1.1TiB Graylog Central (peer support)	8	604	November 22, 2021
Down cluster with 3 node Graylog Central (peer support)	5	897	October 15, 2021
How many replica shard for 3 nodes? Graylog Central (peer support)	13	3185	June 18, 2018
Graylog distributed architecture question - load balancing or redundancy? Graylog Central (peer support)	5	1531	April 24, 2017
Testing Elasticsearch Cluster Graylog Central (peer support)	5	5431	November 29, 2017

Cluster stability depends on index replicas?

Related topics