Hi,
. I am running production with 3 gl nodes and 5 elasticsearch nodes . All on linux bare metal servers, GL latest version and
elasticsearch version 5.6.6 . All was running fine for many months then I wanted to test how cluster behaves without one node.
So I stoppped node sn5 and also rebooted it. All the data was moved well to other 4 nodes. Also the cluster status is OK.
But when I started elasticsearch on node 5, it just can not rejoin cluster,I got these errors in log:
[2018-08-22T10:03:29,956][INFO ][o.e.b.BootstrapChecks ] [sn5] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-08-22T10:03:59,984][WARN ][o.e.n.Node ] [sn5] timed out while waiting for initial discovery state - timeout: 30s
[2018-08-22T10:03:59,997][INFO ][o.e.h.n.Netty4HttpServerTransport] [sn5] publish_address {x.x.x.208:9200}, bound_addresses {x.x.x.208:9200}
[2018-08-22T10:03:59,997][INFO ][o.e.n.Node ] sn5] started
[2018-08-22T10:04:00,324][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [sn5] no known master node, scheduling a retry
[2018-08-22T10:04:01,944][DEBUG][o.e.a.a.i.g.TransportGetIndexAction] [sn5] no known master node, scheduling a retry
[2018-08-22T10:04:03,320][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [sn5] no known master node, scheduling a retry
Also this status is returned:
curl -XGET ‘http://x.x.x.208:9200/_cat/health?v&pretty’
{
“error” : {
“root_cause” : [
{
“type” : " master_not_discovered_exception ",
“reason” : null
}
],
“type” : “master_not_discovered_exception”,
“reason” : null
},
“status” : 503
}
…
I found many similar cases on elastic forum. Most suggestions are to check telnet to ports 9200 and 9300 in both directions. But telnets work just fine in my environment.
The elasticsearch.yml config data is just the same as on other nodes, nothing special:
cluster.name: prod
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: [“x.x.x.204:9300”,“x.x.x.205:9300”,“x.x.x.206:9300”, “x.x.x.207:9300”, “x.x.x.208:9300”]
discovery.zen.minimum_master_nodes: 3
network.host: x.x.x.208
Another test that I did was to stop master node sn1, so that master changed to another server sn4 .
But the “problematic” ode sn5 also dindt want to join to master sn4.
But the previos master sn1 rejoined the cluster just fine. I just dont want to also reboot sn1 right now , because this is one differetn thing that I didi with node sn5. First I need to solve sn5 issue.
Did 2 other tests.
-
I completely removed elasticsearch from node sn5 and reinstalled it.
The problem remained the same. -
I freshly installed a new elasticsearch node N1, it is on the same network as other nodes. I tried to join this “problematic” node sn5 into new cluster with N1. It worked at once without problems and they both formed new cluster. But N1 is not connected with graylog, that is dufferent.
So it seems that something in the elasticsearch configuration is preventing sn5 node to join. I tried some elasticsearch commnds like status of cluster and there is no mentioning of sn5 node.
Any other idea is welcome.
Thanks