Difference between elastic process down VS elastic host down

Hello,

I am using Graylog Open version 3.3 with elasticsearch version 6.8. There are 3 graylog nodes and 7 elasticsearch nodes, all indices are using replicas.

Over weekend one of elastic nodes went down and we noticed graylog queues fillig up.

I was tying to reproduce that. I was doing some tests how graylog behaves if I stop the elasticsearch process on one of the elastic nodes VS what happens if I shut elastic node down. An the difference was really interesting. And I could reporduce this several times.

  1. Stopping elastic process on one node.
    The elasticsearch status may turn to yelow or even red for some time, but the graylog is still working, doing searches, processing messages all the time, they are not getting stuck in the queue.

  2. shutting down the same elasticsearch node.
    Now the elasticsearch status remains the same - RED. But graylog stops processing mesages, they get stuck in the queue, the journal queue is getting longer and longer. If no action is taken, it si filled up.

I found many elasticsearch erros in graylog server log file like:

 "2021-07-22T14:24:00.818+02:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for inde
x set <IMS logi/5bf42d249712e82879d4268b>
org.graylog2.indexer.ElasticsearchException: Couldn't collect indices for alias <index_name>_deflector"

I wanted to check for errors in elasticsearch logs, but they are empty, few minutes after shutdown, no logs in any elastic logs .

  1. Now the elasticsearch node is powerd on again. But the elasticsearch process is still down
    As soon as the elastic node is back again, the graylog starts processing messages again. The journal queue gets smaller and smaller and if enough time, it gets empty.

The elastic is still red most of the time, but if given enough time it will change to green.

Anybody seen similar behaviour ?
I am looking for some solution, that would in the case of elastic node going down prevent filling of the journal, and that graylog would continue to run.

Thanks in advance.

What does your ES cluster config look like? Do you have more than one master eligible node? (Ideally 3 to prevent split brain)

2 Likes

I think any of the 7 ES node can become master, I dont limit master to certain nodes.
The configuration is working well for months. Just I find this scenario interesting,
that from graylog side it makes difference if Node is up or down.

config for elasticsearch:

cluster.name: bigger_graylog
path.data: /cached_d1,/cached_d2,/cached_d3,/cached_d4,/cached_d5,/cached_d6,/cached_d7,/cached_d8,/cached_d9,/cached_d10
path.logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host:<IP1>
discovery.zen.ping.unicast.hosts: [":<IP1>:9300",":<IP2>:9300",":<IP3>:9300",":<IP4>:9300",":<IP5>:9300",":<IP6>:9300",":<IP7>:9300"]
discovery.zen.minimum_master_nodes: 4
path.repo: ["/bkp_mnt"]
xpack.monitoring.enabled: false
http.cors.enabled: true