Elasticsearch node wont rejoin cluster

lecko · August 28, 2018, 7:39am

Hi,

. I am running production with 3 gl nodes and 5 elasticsearch nodes . All on linux bare metal servers, GL latest version and
elasticsearch version 5.6.6 . All was running fine for many months then I wanted to test how cluster behaves without one node.
So I stoppped node sn5 and also rebooted it. All the data was moved well to other 4 nodes. Also the cluster status is OK.
But when I started elasticsearch on node 5, it just can not rejoin cluster,I got these errors in log:

[2018-08-22T10:03:29,956][INFO ][o.e.b.BootstrapChecks ] [sn5] bound or publishing to a non-loopback address, enforcing bootstrap checks
[2018-08-22T10:03:59,984][WARN ][o.e.n.Node ] [sn5] timed out while waiting for initial discovery state - timeout: 30s
[2018-08-22T10:03:59,997][INFO ][o.e.h.n.Netty4HttpServerTransport] [sn5] publish_address {x.x.x.208:9200}, bound_addresses {x.x.x.208:9200}
[2018-08-22T10:03:59,997][INFO ][o.e.n.Node ] sn5] started
[2018-08-22T10:04:00,324][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [sn5] no known master node, scheduling a retry
[2018-08-22T10:04:01,944][DEBUG][o.e.a.a.i.g.TransportGetIndexAction] [sn5] no known master node, scheduling a retry
[2018-08-22T10:04:03,320][DEBUG][o.e.a.a.i.c.TransportCreateIndexAction] [sn5] no known master node, scheduling a retry

Also this status is returned:
curl -XGET ‘http://x.x.x.208:9200/_cat/health?v&pretty’
{
“error” : {
“root_cause” : [
{
“type” : " master_not_discovered_exception ",
“reason” : null
}
],
“type” : “master_not_discovered_exception”,
“reason” : null
},
“status” : 503
}

…
I found many similar cases on elastic forum. Most suggestions are to check telnet to ports 9200 and 9300 in both directions. But telnets work just fine in my environment.

The elasticsearch.yml config data is just the same as on other nodes, nothing special:

cluster.name: prod
bootstrap.memory_lock: true
discovery.zen.ping.unicast.hosts: [“x.x.x.204:9300”,“x.x.x.205:9300”,“x.x.x.206:9300”, “x.x.x.207:9300”, “x.x.x.208:9300”]
discovery.zen.minimum_master_nodes: 3
network.host: x.x.x.208

Another test that I did was to stop master node sn1, so that master changed to another server sn4 .
But the “problematic” ode sn5 also dindt want to join to master sn4.
But the previos master sn1 rejoined the cluster just fine. I just dont want to also reboot sn1 right now , because this is one differetn thing that I didi with node sn5. First I need to solve sn5 issue.

Did 2 other tests.

I completely removed elasticsearch from node sn5 and reinstalled it.
The problem remained the same.
I freshly installed a new elasticsearch node N1, it is on the same network as other nodes. I tried to join this “problematic” node sn5 into new cluster with N1. It worked at once without problems and they both formed new cluster. But N1 is not connected with graylog, that is dufferent.

So it seems that something in the elasticsearch configuration is preventing sn5 node to join. I tried some elasticsearch commnds like status of cluster and there is no mentioning of sn5 node.
Any other idea is welcome.

Thanks

jan · August 28, 2018, 7:55am

I have seen this only when you have a typo in cluster.name and the nodes are not in the same cluster … What is the elasticsearch logfile of the other hosts when the problem node tries to join?

lecko · August 28, 2018, 8:22am

elasticsearch logfile on the master sn4 when this node is trying to join is such, Again timeout is mentioned.

"[2018-08-28T10:06:26,520][WARN ][o.e.d.z.ZenDiscovery     ] [Xu4ewOU] failed to validate incoming join request from node [{sn5}{QN7BUEBwRJKSEemLYfClXA}{cy3HKazDQZGlAqBgqmM5gA}{x.x.x..208}{x.x.x.208:9300}]
org.elasticsearch.ElasticsearchTimeoutException: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:63) ~[elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:33) ~[elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.discovery.zen.MembershipAction.sendValidateJoinRequestBlocking(MembershipAction.java:104) ~[elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.discovery.zen.ZenDiscovery.handleJoinRequest(ZenDiscovery.java:857) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.discovery.zen.ZenDiscovery$MembershipListener.onJoin(ZenDiscovery.java:1038) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:136) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.discovery.zen.MembershipAction$JoinRequestRequestHandler.messageReceived(MembershipAction.java:132) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1556) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.6.6.jar:5.6.6]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
        at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:232) ~[elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:67) ~[elasticsearch-5.6.6.jar:5.6.6]
        at org.elasticsearch.transport.PlainTransportFuture.txGet(PlainTransportFuture.java:61) ~[elasticsearch-5.6.6.jar:5.6.6]

I must add that if i leave elasticsearch on sn5 running while it is trying to connect to master of cluster I get major problem on the whole graylog/elasticsearch environment. Messages get stucked in the gl queues and are not processed . Only after I stop elasticsearch process GL quite quickly returns to normal.

Thanks for the idea about typo. I followed it nad changed cluster name on sn5 node and it did detect that the cluster names are different and then it started its own one node elastic cluster. But this is not the solution I search for…

jan · August 28, 2018, 8:47am

please format your post that is more reable: FAQ - Graylog Community

Thanks for the idea about typo. I followed it nad changed cluster name on sn5 node and it did detect that the cluster names are different and then it started its own one node elastic cluster. But this is not the solution I search for…

did fixing the typo fixed your problem or not?

lecko · August 28, 2018, 9:54am

No , fixing the typo didnt fix my problem. Because my cluster name was correct from before.
After your idea about trypo, I just intentionaly did a typo in cluster name to see what happens. The errors were different, man mentioning that cluster names dont match. After them node sn5 started and formed its own standalone cluster. This is not the solution that I look for, as I would like it to join the existing 4 node elasticsearch cluster.

jan · August 28, 2018, 10:06am

No

That wasn’t clear from your writing.

According to the error you have - looking into the elasticsearch code - it is something wrong with the communication or with the node.

github.com

elastic/elasticsearch/blob/db6e8c736d92582fe56024993450ce0a987b498d/server/src/main/java/org/elasticsearch/discovery/zen/ZenDiscovery.java#L858-L883


      
          void handleJoinRequest(final DiscoveryNode node, final ClusterState state, final MembershipAction.JoinCallback callback) {
              if (nodeJoinController == null) {
                  throw new IllegalStateException("discovery module is not yet started");
              } else {
                  // we do this in a couple of places including the cluster update thread. This one here is really just best effort
                  // to ensure we fail as fast as possible.
                  onJoinValidators.stream().forEach(a -> a.accept(node, state));
                  if (state.getBlocks().hasGlobalBlock(STATE_NOT_RECOVERED_BLOCK) == false) {
                      MembershipAction.ensureMajorVersionBarrier(node.getVersion(), state.getNodes().getMinNodeVersion());
                  }
                  // try and connect to the node, if it fails, we can raise an exception back to the client...
                  transportService.connectToNode(node);
          
                  // validate the join request, will throw a failure if it fails, which will get back to the
                  // node calling the join request
                  try {
                      membership.sendValidateJoinRequestBlocking(node, state, joinTimeout);
                  } catch (Exception e) {
                      logger.warn(() -> new ParameterizedMessage("failed to validate incoming join request from node [{}]", node),
                          e);

This file has been truncated. show original

As we are the Graylog community and not the Elasticsearch community you might want to check in that community what the issue might be.

lecko · August 28, 2018, 10:20am

I already submitted question there 3 weeks ago, but no reply. So Graylog community is more responsive. Thanks.

jan · August 28, 2018, 12:19pm

you might need to sherlock out what the difference is between this and the other nodes.

Finding difference in configuration and installed packages. Checking SELinux and Firewall would be my first actions.

… and thank you for the flowers.

system · September 11, 2018, 12:19pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Down cluster with 3 node Graylog Central (peer support)	5	885	October 15, 2021
Elastic cluster connection lost Graylog Central (peer support)	9	2686	August 29, 2018
Could not connect to Elasticsearch Graylog Central (peer support)	6	8603	April 25, 2017
Cluster nodes not showing in Graylog gui Graylog Central (peer support)	4	1460	May 2, 2019
Fatal error in thread Graylog Central (peer support)	14	4658	January 17, 2018

Elasticsearch node wont rejoin cluster

Related topics