Cannot sustain a master node - logs flooded with "Did not find meta info of this node. Re-registering"

I’m running a Graylog cluster with 2 nodes, and recently upgraded from 3.3.8 to 4.0.1. After the upgrade, the cluster is unable to sustain the node status.

The System/Nodes view is inconsistent, sometimes showing one node, both or neither at any moment. Nodes are coming and going very quickly, but they are running and are able to ingest logs normally. Reducing the cluster size to a single node makes the problem go away, but I need the redundancy.

Once the second node is started, the first node is flooded with these log messages that repeat every second.

 2021-01-14 07:16:23,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:24,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:25,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:26,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:27,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:32,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:37,066 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:38,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:39,069 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:40,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:41,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:42,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:43,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:44,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:45,072 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:46,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:47,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:48,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:49,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:50,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      

I have found discussions that point to clock skews between servers. That has been checked and all hosts are ntp synchronized.

I can watch the mongodb collection in realtime and the node keeps being deleted and reinserted every second.

Aside from clock skews and memory GC delays, which I ruled out, anything else could be causing this?

Thank you.

Hi there,

Offhand, I’m not sure what would cause this. Do you have any sort of monitoring for these nodes (e.g., resource, network, etc.)? Are you seeing anything in terms of resource utilization that seems out of whack?

Thank you aaronsachs. I was able to find the root cause of this.

Turns out, my config had stale_master_timeout=30 without realizing this value is measured in milliseconds. I was assuming that meant 30s and it should be 30000.

Even with clocks synced, 30ms is way too short for the nodes to keep up.

As a reminder to anyone having issues with node status updates, the code that does the update sends an UPDATE query to mongodb with a filter condition based on an offset of the current timestamp (stale_master_timeout).

That means the record update will not happen if it was already sent too recently. It will be replaced if it is older than stale_master_timeout. So that’s what that value means for Graylog internally.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.