We are running Graylog in a kubernetes cluster. The VMs automatically get OS updates, and we use kured to automatically drain the nodes before rebooting them.
When this happened last night Graylog reported ~ 1000 indexing failures, with the message
{“type”:“unavailable_shards_exception”,“reason”:"[graylog_10][2] primary shard is not active Timeout: [1m],
We have 2 graylog pods. Only 1 rebooted last night.
We have 3 elasticsearch-master-n pods. Only 1 of those rebooted last night.
Does an indexing failure message indicate that the log has been lost? Or is it buffered and retried?
If lost, what do I need to look into to prevent this?
Looking at the app logs using kubectl has the downside that after reboots you lose your history. Graylog is much better in this regard. But losing logs during reboots reduces this value.
the message indicate, that you have lost messages.
Because the Elasticsearch Nodes that holds the primary shard of the index got a reboot and was not present to write (I guess). What is your index setting? How many shards and Replicas do you have configured?