Notification condition [NO_MASTER] has been fixed

Hello,

We are receiving this message at every minute.
Already checked the NTP and GC, both looks ok.

[ ~]# jhsdb jmap --heap --pid 99383
Attaching to process ID 99383, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 11.0.6+10-LTS

using thread-local object allocation.
Garbage-First (G1) GC with 23 thread(s)

Heap Configuration:
   MinHeapFreeRatio         = 40
   MaxHeapFreeRatio         = 70
   MaxHeapSize              = 17179869184 (16384.0MB)
   NewSize                  = 1363144 (1.2999954223632812MB)
   MaxNewSize               = 8589934592 (8192.0MB)
   OldSize                  = 5452592 (5.1999969482421875MB)
   NewRatio                 = 1
   SurvivorRatio            = 8
   MetaspaceSize            = 21807104 (20.796875MB)
   CompressedClassSpaceSize = 1073741824 (1024.0MB)
   MaxMetaspaceSize         = 17592186044415 MB
   G1HeapRegionSize         = 8388608 (8.0MB)

Heap Usage:
G1 Heap:
   regions  = 2048
   capacity = 17179869184 (16384.0MB)
   used     = 5469285728 (5215.917327880859MB)
   free     = 11710583456 (11168.08267211914MB)
   31.835432909429073% used
G1 Young Generation:
Eden Space:
   regions  = 405
   capacity = 9017753600 (8600.0MB)
   used     = 3397386240 (3240.0MB)
   free     = 5620367360 (5360.0MB)
   37.674418604651166% used
Survivor Space:
   regions  = 1
   capacity = 8388608 (8.0MB)
   used     = 8388608 (8.0MB)
   free     = 0 (0.0MB)
   100.0% used
G1 Old Generation:
   regions  = 248
   capacity = 8153726976 (7776.0MB)
   used     = 2071899488 (1975.9173278808594MB)
   free     = 6081827488 (5800.082672119141MB)
   25.41045946348842% used

http_bind_address set to 0.0.0.0:9000 on all nodes (http_publish_uri is not set).

In the log theres this error about the event processor but I dont know if it is a consequence or cause of the master ā€˜failureā€™. The alert page does not load.

2020-04-17T15:34:00.694-03:00 ERROR [EventProcessorEngine] Caught an unhandled exception while executing event processor <aggregation-v1/Teste CDNPS/5dc031b6785c9de74c0fe418> - Make sure to modify the event processor to throw only EventProcessorExecutionException so we get more context!
com.google.inject.ProvisionException: Unable to provision, see the following errors:

1) Error injecting constructor, java.lang.RuntimeException: Couldn't get node ID
  at org.graylog.events.event.EventProcessorEventFactory.<init>(EventProcessorEventFactory.java:34)
  while locating org.graylog.events.event.EventProcessorEventFactory

The message is shown basically on all nodes but mainly on master.

We have the master node(note receiving data) and six data nodes with the same config:
Graylog 3.1.2
JVM Oracle Corporation 11.0.6 on Linux 3.10.0-1062.9.1.el7.x86_64
32 CPU, 32 GB (16 heap)

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret =  <edited>
root_password_sha2 = <edited>
root_timezone = America/Sao_Paulo
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
web_listen_uri = http://0.0.0.0:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 7
retention_strategy = delete
elasticsearch_shards = 3
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = <edited>
elasticsearch_hosts = <edited>
elasticsearch_max_total_connections = 150
elasticsearch_analyzer = standard
output_batch_size = 1000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 32
outputbuffer_processor_threads_core_pool_size = 6
outputbuffer_processor_threads_max_pool_size = 60
inputbuffer_processors = 8
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 95gb
lb_recognition_period_seconds = 3
mongodb_uri = <edited>
mongodb_max_connections = 500
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = <edited>
transport_email_port = 25
transport_email_use_auth = false
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_subject_prefix = [graylog]
transport_email_from_email = <edited>
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32
#http_enable_tls = true
#gc_warning_threshold = 2s

What else should we look?
Thanks in advance

Has this been working and this issue is now coming up or is this a new install and you are trying to solve this issue as part of the initial deploy?

without pointing a finger. keep in mind that Graylog 3.x is the first version to support Java 11, so there may be some issues to iron out there. not blaming itā€¦ just mentioning.

Hi cawfehman,

Canā€™t tell you when it started. It is a legacy problem from other people. What I can say is that we upgraded to Java11 to try to solve the problem, what didnt happen.

sounds like something is intermittently interrupting comms between the graylog nodesā€¦ is anything between them that could be causing connectivity issues? Not sure if thereā€™s a way to debug the heartbeat between nodes, but if you do a packet capture on the master node for all traffic between the master and slave nodes, you may see something.

Off the top of my head, things that can randomly block traffic include

firewalls shunning traffic
NGFW or IDS/IPS doing deep packet inspection
proxy-arp
duplicate IP address
and then thereā€™s the malicious options too

@RCR Iā€™m new to Graylog but hopefully I can be of some help. Based on looking at the EventProcessorEventFactory.java code the node ID is being retrieved by first retrieving the hostname of the node from the Graylog configuration. Unfortunately, there isnā€™t sufficient granularity in the code to determine if the hostname retrieval is the part that is failing or if itā€™s still something related to the node ID lookup using the hostname. So we have to tackle one at a time to see what we find. I donā€™t think this is a connection issue from node to node because there isnā€™t any indication that communication is attempting to occur over the network. This seems to be an issue with the configuration in the database.

Can you provide the contents of your /etc/graylog/server/node-id files for review? The UUID in the file for each node should match the corresponding UUID in the Mongo database.

Additionally, the hostnames are set in Graylog when the nodes are registered.

To confirm the hostnames and UUIDs are correct you can connect to the Mongo db.

Run ā€˜mongoā€™ at the command shell on your master node. Then execute the following 2 commands to retrieve the node config information in json format.

use graylog
db.nodes.find()

Provide this output for review.

Is it possible you have nodes that were registered multiple times?

hi @bmccombs the node-id is ok. Both the IDs are the same(in mongo and in node-id file). Thats my master, for example.

{ "_id" : ObjectId("5ea05967da56ea6327103ef4"), "is_master" : true, "hostname" : "<edited OK>", "last_seen" : 1587567010, "transport_address" : "http://<edited OK>:9000/api/", "type" : "SERVER", "node_id" : "b0a91de9-7b40-4d68-b066-1ebf8a9159d1"}

Is it possible you have nodes that were registered multiple times?

No, theres only one record for each node.

The message is shown by the master, just a few times for the other nodes. Do you think that connection between nodes can be the cause? To make a test, I disabled firewalld and the problem didnt stop.

do you have the underlying MongoDB setup as a replica set? Iā€™m not sure that it is a network issue, but that type of message is a typical heartbeat lost type of message/failover.

Iā€™m curious what path @bmccombs is heading down, but if I were troubleshooting thisā€¦ I would definitely try to rule out the network.

@RCR
Have your nodesā€™ hostnames changed since they were registered? Maybe itā€™s worthwhile to check /etc/hosts and/or DNS records to see if the hostnames differ from what Mongodb has stored for them.

Do you have these logs being sent to Graylog or some place where you can try searching to see when they first occurred? That can help to isolate and maybe identify the root cause.You asked initially if these errors are a cause or effect of the masterā€™s failure. If you can search for these logs and target the timeframe when you know the master failed then you can see if these errors were occurring beforehand to confirm your initial question.

@cawfehman The code indicates itā€™s trying to do a lookup of node id using the hostname (by using data either cached or hitting the db directly, I canā€™t tell which). So itā€™s not yet trying to communicate with other nodes yet based on that. Agree? The code below is from latest revision rather than the version RCR is using.

 public EventProcessorEventFactory(ULID ulid, NodeService nodeService, NodeId nodeId) {
    this.ulid = ulid;
    try {
        // TODO: This can fail depending on when it's called. Check if we can load it in create() and cache it
        this.source = nodeService.byNodeId(nodeId).getHostname();
    } catch (NodeNotFoundException e) {
        throw new RuntimeException("Couldn't get node ID", e);
    }
}

Have your nodesā€™ hostnames changed since they were registered? Maybe itā€™s worthwhile to check /etc/hosts and/or DNS records to see if the hostnames differ from what Mongodb has stored for them.

I have checked, the hostname is correct in /etc/hosts. They are the same. Didnt check DNS, but tried to ping and theres no problem.

Do you have these logs being sent to Graylog or some place where you can try searching to see when they first occurred? That can help to isolate and maybe identify the root cause.You asked initially if these errors are a cause or effect of the masterā€™s failure. If you can search for these logs and target the timeframe when you know the master failed then you can see if these errors were occurring beforehand to confirm your initial question.

We donā€™t have it stored anywhere else and also canā€™t tell when it started because when I started to work with it it was already happening and I we donā€™t have logs older than that. :sob: thats why I am trying anything.

Iā€™ll set another node as master, to see what happens. We had an issue, so I am waiting the acumulated messages to be stored on elastic so Iā€™ll try it.

Ping will use the hosts file first before DNS unless the /etc/nsswitch.conf file says differently. So most likely the ping didnā€™t actually test the DNS records. Iā€™m not sure if formal name resolution is playing a part here or not but it was a simple enough issue to test so thatā€™s why I mention this.

@bmccombs, I think Iā€™m following you nowā€¦

Iā€™m no java expertā€¦ hardly a beginner tbhā€¦ so I could be off here, but after stumbling through some of the code, it seems to me that the NodeId is passed into the EventProcessorEventFactory method, so wouldnā€™t it be the other way around? trying to get the hostname based on the NodeID?

If that is true, then Iā€™m thinking 3 things:

  1. the Error is a bit misleading as it should say ā€œCouldnā€™t get hostnameā€.
  2. what is the nodeID being passed in and where is it coming from?
  3. could this be due to some underlying MongoDB issue? Replica set, etcā€¦

I do agree with you now that I donā€™t think itā€™s a network issue. At least not yet :slight_smile:

@RCR do you have a MongoDB replica set setup across your nodes? If so, any errors there?

have you checked the NTP, and nodeā€™s time. Usually it can cause that problem.
And also check graylogā€™s log.

I was looking at Mongodb too but theres nothing on the logs so donā€™t know where else to look.

Yes, we a have a replicaset.
At 07:39 we got the message on Graylog and that is the the Mongodbā€™s logs:

2020-04-24T07:38:24.072-0300 I NETWORK  [conn102741] end connection x.x.x.x:xxxx (502 connections now open)
2020-04-24T07:38:39.568-0300 I STORAGE  [WT OplogTruncaterThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 1ms
2020-04-24T07:38:54.072-0300 I NETWORK  [listener] connection accepted from x.x.x.x:xxxx #102742 (503 connections now open)
2020-04-24T07:38:54.072-0300 I NETWORK  [conn102742] end connection x.x.x.x:xxxx (502 connections now open)
2020-04-24T07:39:24.072-0300 I NETWORK  [listener] connection accepted from x.x.x.x:xxxx #102743 (503 connections now open)
2020-04-24T07:39:24.073-0300 I NETWORK  [conn102743] end connection x.x.x.x:xxxx (502 connections now open)
2020-04-24T07:39:31.746-0300 I STORAGE  [WT OplogTruncaterThread: local.oplog.rs] WiredTiger record store oplog truncation finished in: 2ms
2020-04-24T07:39:54.072-0300 I NETWORK  [listener] connection accepted from x.x.x.x:xxxx #102744 (503 connections now open)
2020-04-24T07:39:54.072-0300 I NETWORK  [conn102744] end connection x.x.x.x:xxxx (502 connections now open)
2020-04-24T07:40:11.267-0300 I ACCESS   [conn48611] Successfully authenticated as principal __system on local
2020-04-24T07:40:11.270-0300 I ACCESS   [conn48611] Successfully authenticated as principal __system on local
2020-04-24T07:40:11.273-0300 I ACCESS   [conn48611] Successfully authenticated as principal __system on local

And this is the Graylogā€™s master logs:

2020-04-24T07:35:53.032-03:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2020-04-24T07:38:02.220-03:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2020-04-24T07:39:31.172-03:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2020-04-24T08:18:54.164-03:00 INFO  [IndexRangesCleanupPeriodical] Removing index range information for unavailable indices: [8acd95272406056f8beee0ce8eeee02f_testeteste_0, 464c5b1ef22aec3a75c016b5ca6b335b_testeteste_0, 823ee89f5d97dd3bc732e61265fa30a9_testeteste_0, d3dd23f23876edebb28778294dc6a922_pro_99999_sms_0, 5c915bb69de9b27d2b1e9e3ca8e2d16c_testeteste_0, 3b87da1cddaa60e2a6318e821413ecc8_testeteste_0]

Yesterday also tryed to disable firewalld on both nodes (Graylog Master and MongoDB primary) and had put another node as GL master but the problem persists.

Argh, yes, you are right. Apparently I was rushing too much when reviewing the code so I was misreading it. Itā€™s retrieving the hostname based on the nodeId, which is passed into the method, so I agree with your assessment of that part.

So, yes, in my opinion the error should really be that it couldnā€™t get the hostname because the nodeID should already be known when the EventProcessorEventFactory method is called. It probably wouldnā€™t hurt to have a check for a null nodeID for that matter. And it would be great to have the error include the nodeID and what the retrieved hostname was to know, for example, if the hostname was empty string or some other value, in order to do some proper debugging/troubleshooting for future issues.

The best we can go on right now is the fact that the exception is called a NodeNotFoundException which would indicate that the nodeID being used simply doesnā€™t exist and therefore the hostname is probably null or an empty string. So another theory for @RCR might be that maybe a node was removed and not cleaned up properly and this code is still finding a reference to it somewhere.

Maybe this could have been the result of the masterā€™s previous failure if the master node had to be re-registered which I assume would have generated a new UUID but, depending on how that issue was addressed, maybe there are still some references to the old UUID being used for some code paths.

And there is any way to ā€œresetā€ the replica set?

Do you think that elasticsearch can be related with this issue? I donā€™t, but I am asking just in case.

Itā€™s definitely possible to reset the replica set:
https://www.google.com/search?client=firefox-b-1-d&q=mongodb+reset+replica+set

How much youā€™ll have to be concerned about Graylog in the process of doing that is not something I know the answer to. Iā€™m not sure what you mean when asking if ES can be related.

I will discuss with the team if we should try the reset, thanks!

I was just wondering if there is any communication between Graylog and Elastic that could be causing this. Our cluster have 42 data nodes, 3 master and maybe some config could be wrong.

The fact is that the whole system is working, we are working with 15k m/s without problems.

I was just wondering if there is any communication between Graylog and Elastic that could be causing this. Our cluster have 42 data nodes, 3 master and maybe some config could be wrong.

The fact is that the whole system is working, we are working with 15k m/s without problems.

I canā€™t say for certain but I doubt ES or Graylog caused this. My theory is that it is a result of your previous failure that you mentioned but I donā€™t have a way to prove that theory right or wrong. I donā€™t know enough about Graylog and the underlying components to know how likely it is that these errors are a result of the master nodeā€™s previous failure or a symptom. But hopefully the reset works.

1 Like