Indexer failures on restarts

We are running Graylog in a kubernetes cluster. The VMs automatically get OS updates, and we use kured to automatically drain the nodes before rebooting them.
When this happened last night Graylog reported ~ 1000 indexing failures, with the message
{“type”:“unavailable_shards_exception”,“reason”:"[graylog_10][2] primary shard is not active Timeout: [1m],

We have 2 graylog pods. Only 1 rebooted last night.
We have 3 elasticsearch-master-n pods. Only 1 of those rebooted last night.

Does an indexing failure message indicate that the log has been lost? Or is it buffered and retried?
If lost, what do I need to look into to prevent this?

Looking at the app logs using kubectl has the downside that after reboots you lose your history. Graylog is much better in this regard. But losing logs during reboots reduces this value.

the message indicate, that you have lost messages.

Because the Elasticsearch Nodes that holds the primary shard of the index got a reboot and was not present to write (I guess). What is your index setting? How many shards and Replicas do you have configured?

This is our dev system we are using to learn about Graylog. I didn’t do much thinking about the ES setup.
5 indices
4 shards
0 replicas

Is the 0 replicas the issue?
If this is an ES problem, I will get some help from the ES expert on the team.

if you have no replicas and 4 shards, but 3 graylog servers every reboot of one ES server will make the index not available.

Yes that is an elasticsearch problem.

Just to follow up, after setting up replicas, I’ve not had any more indexing failures.
Thanks @jan

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.