I’m running a Graylog cluster with 2 nodes, and recently upgraded from 3.3.8 to 4.0.1. After the upgrade, the cluster is unable to sustain the node status.
The System/Nodes view is inconsistent, sometimes showing one node, both or neither at any moment. Nodes are coming and going very quickly, but they are running and are able to ingest logs normally. Reducing the cluster size to a single node makes the problem go away, but I need the redundancy.
Once the second node is started, the first node is flooded with these log messages that repeat every second.
2021-01-14 07:16:23,502 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:24,502 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:25,502 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:26,502 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:27,502 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:32,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:37,066 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:38,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:39,069 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:40,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:41,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:42,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:43,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:44,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:45,072 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:46,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:47,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:48,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:49,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
2021-01-14 07:16:50,067 WARN [NodePingThread] - Did not find meta info of this node. Re-registering. - {}
I have found discussions that point to clock skews between servers. That has been checked and all hosts are ntp synchronized.
I can watch the mongodb collection in realtime and the node keeps being deleted and reinserted every second.
Aside from clock skews and memory GC delays, which I ruled out, anything else could be causing this?
Thank you.