Cannot sustain a master node - logs flooded with "Did not find meta info of this node. Re-registering"

juliohm1978 · January 21, 2021, 9:34pm

I’m running a Graylog cluster with 2 nodes, and recently upgraded from 3.3.8 to 4.0.1. After the upgrade, the cluster is unable to sustain the node status.

The System/Nodes view is inconsistent, sometimes showing one node, both or neither at any moment. Nodes are coming and going very quickly, but they are running and are able to ingest logs normally. Reducing the cluster size to a single node makes the problem go away, but I need the redundancy.

Once the second node is started, the first node is flooded with these log messages that repeat every second.

 2021-01-14 07:16:23,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:24,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:25,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:26,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:27,502 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:32,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:37,066 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:38,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:39,069 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:40,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:41,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:42,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:43,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:44,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:45,072 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:46,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:47,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:48,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:49,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}                                                                                                      
 2021-01-14 07:16:50,067 WARN    [NodePingThread] - Did not find meta info of this node. Re-registering. - {}

I have found discussions that point to clock skews between servers. That has been checked and all hosts are ntp synchronized.

I can watch the mongodb collection in realtime and the node keeps being deleted and reinserted every second.

Aside from clock skews and memory GC delays, which I ruled out, anything else could be causing this?

Thank you.

aaronsachs · January 22, 2021, 3:26pm

Hi there,

Offhand, I’m not sure what would cause this. Do you have any sort of monitoring for these nodes (e.g., resource, network, etc.)? Are you seeing anything in terms of resource utilization that seems out of whack?

juliohm1978 · January 30, 2021, 4:43pm

Thank you aaronsachs. I was able to find the root cause of this.

Turns out, my config had stale_master_timeout=30 without realizing this value is measured in milliseconds. I was assuming that meant 30s and it should be 30000.

Even with clocks synced, 30ms is way too short for the nodes to keep up.

As a reminder to anyone having issues with node status updates, the code that does the update sends an UPDATE query to mongodb with a filter condition based on an offset of the current timestamp (stale_master_timeout).

That means the record update will not happen if it was already sent too recently. It will be replaced if it is older than stale_master_timeout. So that’s what that value means for Graylog internally.

system · February 13, 2021, 4:44pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog 3.1.3 - Master node flap / NodePingThread Graylog Central (peer support)	2	1765	December 20, 2019
Graylog 2.2 Cluster Issue Graylog Central (peer support)	6	4362	April 10, 2017
Graylog Gluster Node 1 error Graylog Central (peer support)	3	711	September 25, 2018
Graylog 2.3 cluster issue Graylog Central (peer support)	9	2577	December 27, 2017
Graylog-Node won't join the cluster Graylog Central (peer support)	3	2737	November 18, 2019

Cannot sustain a master node - logs flooded with "Did not find meta info of this node. Re-registering"

Related topics