Multi-node KO after ES and Graylog update

Hello,

After ES and Graylog update Cluster ID seems to be changed. So I lost my free Entreprise License. I asked a new one. I stopped graylog-server service on fist node, put is.master=true on second node and restart service. An other Cluster ID appear for second node…
In System - Nodes menu only 1 node is present. Before update there were 2 nodes (multi-node configuration). ES cluster and mongod seems OK on each node.
Updates done:
ES : 6.8.3 => 7.10.2
Graylog : 3.3 => 4.2 (entreprise)
Mongod : 4.0.12 => 4.0.28

  • OS Information:
    Debian 9. With 2 nodes in multi-node environment.

  • Package Version:

  • ES 7.10.2
  • Graylog Entreprise 4.2
  • Mongod 4.0.28

Why my multi-node cluster is down after this update? => Only one node in “nodes” menu. Why Cluster ID changed? How to regenerate Cluster ID to have the same Cluster ID on each node?

My graylog update (3.3 => 4.2) procedure:

=== Procedure start ===
service graylog-server stop
service elasticsearch stop
service mongod stop

wget https://packages.graylog2.org/repo/packages/graylog-4.2-repository_latest.deb
dpkg -i graylog-4.2-repository_latest.deb
apt-get update
apt-get upgrade

service elasticsearch start
=> Up and running. And ES cluster is green (=> OK).

service mongod start
=> Cluster is OK.

service graylog-server start
checks on web browser.
=== Procedure end ===

Regards,
Thomas

Hello,

Since it free I believe you can get a new one.

Did you check the log files on the node missing from the web ui?

Is the content of the node_id_file at /etc/graylog/server/node-id unique for each Graylog node?

The node ID has to be unique for each node.

“Since it free I believe you can get a new one.”

I can get a new Entreprise license and I do it. With cluster ID of first node I asked a new license. After that, I checked cluster ID of second node but it was:
"Installed licenses
This Graylog cluster ID is 00000000-0000-0000-0000-000000000000
There are no licenses installed in this cluster. "
I put this node: is.master=true and restart graylog-server servie and Cluster ID is now different on each node… Before update Cluster ID was the same on each (because it was cluster). So cluster seems to be broken but why… ?

“Did you check the log files on the node missing from the web ui?”

Yes I don’t see anything why cluster is down and cluster ID changed :frowning:

“Is the content of the node_id_file at /etc/graylog/server/node-id unique for each Graylog node?”

Node ID is different on each node.

Regards.

Hello,
Can I ask how many nodes you have in your cluster? Do you have Graylog, MongoDb, and Elasticsearch on each node? or are the separated?

That is a true statement, you should see.

Couple questions I need to ask.

  • Do you have configured http_bind_address unique for each node? Mean IP Address or FQDN?
  • Is it possible to show how you configured your cluster. Its hard to help troubleshoot this issue. At this point I’m just guessing
  • Are you using any type of load balancer?
  • Did you check each node to make sure its the same?

Hello,

Please find below answers.
But for information I tried to do update step-by-step and It appear that cluster is down and Cluster ID change after Graylog update from 3.2.6 => to 3.3.16.
With ES v6.8.23 and Mongod v4.0.28.

Here answers:

Can I ask how many nodes you have in your cluster? Do you have Graylog, MongoDb, and Elasticsearch on each node? or are the separated?

2 nodes in Graylog cluster. In total I have 3 nodes :

  • 2 nodes with Graylog, ElasticSearch and Mongod
  • 1 node with ElasticSearch and Mongod only (to avoid split-brain).
  • Do you have configured http_bind_address unique for each node? Mean IP Address or FQDN?

Yes. For my 2 graylog nodes:
Node 1: /etc/graylog/server/server.conf:http_bind_address = node1.xxx.xxx.xxx:9000
Node2: /etc/graylog/server/server.conf:http_bind_address = node2.xxx.xxx.xxx:9000

  • Is it possible to show how you configured your cluster. Its hard to help troubleshoot this issue. At this point I’m just guessing

Yes, which configuration files you need to check this?

  • Are you using any type of load balancer?

No.

  • Did you check each node to make sure its the same?

Yes version is the same and I don’t modify configuration after update.

Regard.

Hello @Service

Perhaps both your Graylog configuration files. see if we cant find a issue/problem there.
If your cluster is down there has to be something in your log files. Did you check all your nodes logs files, and System journal (root # journalctl -xe)?

I assume you have one master between your two Graylog servers? Is this correct.
Do you have time sync on your nodes using NTP?
Are you using FQDN or IP Address for binding?

If all fails try to start simple, if this is production environment you may not want to do this.
Stop Graylog and MongoDb service on all nodes keep Elasticsearch running. Then check the status of your Elasticsearch cluster.

curl -X GET "localhost:9200/_cluster/health?pretty=true"
or
curl -X GET "localhost:9200/_nodes/_all/process?pretty"

If everything is work and you see its in “GREEN”, then start up All your MongoDb services.
Make sure MongoDb can see/connect to each other.
To check you MongoDb cluster as follow:

Execute command for Mongo shell.

shell# mongo

Then check the replica sets status with the rs query below.
Shell# rs.status()

You should see your cluster status here.
This is just an example of the command executed.

connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("73b8bc01-6bdf-488d-a2e2-f4da5348309f") }
MongoDB server version: 4.4.12
---
The server generated these startup warnings when booting:
        2022-01-27T21:41:22.467-06:00: You are running on a NUMA machine. We suggest launching mongod like this to avoid performance problems: numactl --interleave=all mongod [other options]
---
> show databases;
admin    0.000GB
config   0.000GB
graylog  0.028GB
local    0.000GB
test     0.000GB
> use graylog;
switched to db graylog
> rs.status()
{
        "ok" : 0,
        "errmsg" : "not running with --replSet",
        "code" : 76,
        "codeName" : "NoReplicationEnabled"
}
>

You also can check each mongodb node in shell.
Query to check the status on node-01
Shell# rs.isMaster() // Should Show “isMaster” =True

If it does not say its TRUE, check the other MongoDb nodes.

At this point your Elasticsearch/MongoDb cluster should be working fine. If not, make sure they are first.

This would eliminate those interfering with your cluster and we can focus on Graylog Services.

So you have two Graylog Nodes, not sure how you configured these two but one should be a master is_master = TRUE and the second node is_master = FALSE .

you should take care that only one Graylog node is configured to be master with the configuration setting is_master = true

EDIT: Have you seen this section?

Note that you only need Graylog, not Elasticsearch or MongoDB, on the Server. For easy configuration, just copy the Graylog server.conf from the already running Graylog Server to this new one. Then replace the IP address or hostname on the new node in any location that is found in the configuration file. Typically this means that you replace rest_listen_uri, web_listen_uri and elasticsearch_network_host. Most importantly, set is_master to false as the Graylog cluster will not elect the master automatically.

When they state rest_listen_uri/web_listen_uri on the newer versions that would be the http_bind_address.

From here.

Hello,

I send you now server.log and server.conf in a private message. I anonymized it for security…

journalctl -xe : all is OK on both nodes:
“Unit elasticsearch.service has finished starting up.
Unit mongod.service has finished starting up.
Unit graylog-server.service has finished starting up.”

Yes. I have 2 nodes: 1 master and 1 secondary (is.master = false).

Hour is the same between nodes and I use FQDN, communication between is OK. (ES & Mongo clusters are OK - as you can see below).

Cluster is OK:
{
“cluster_name” : “graylog”,
“status” : “green”,
“timed_out” : false,
“number_of_nodes” : 3,
“number_of_data_nodes” : 2,
“active_primary_shards” : 119,
“active_shards” : 238,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 0,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 100.0
}

Mongo cluster is OK too:
rs01:PRIMARY> rs.status()
[…]
“ok” : 1,

rs01:PRIMARY> rs.isMaster()
“ismaster” : true,

Yes:
“is_master = true” for node1 and “is_master = false” for node2.

Yes but I didn’t yet test it because I don’t understand why Cluster ID changed on node1 after 3.2.6 => 3.3.16 udpate… So I lost my license. And after udpate I don’t see my node 2 anymore in “Nodes” menu.

I send you anonymized log and conf files for 2 nodes.

Regards,

Hello,
I’m confused on the following statements.

I realize that the settings before the Upgrade was working, and this was a major version upgrade I think because you have two conflicting statements as shown above. Some settings and configuration have changed between versions.
I did get a chance to look over both log files, and I see where Graylog had to fix a couple things, I also see the Graylog Nodes have ID within the logs. I’m assuming there is nothing in the elasticsearch or MongoDb log files that would pertain to this? The only thing I can think of is the configurations.
Since there was an install of two major versions GL & ES (stated from the first post) and some setting have change, I’m wondering if this could be the problem/issue. So, I had to dig through my personal documentation on this to see what I did and compared to what you have.

I also noticed your using HTTPS/FQDN, so I have some suggestions for this.

Since there has been a lot of posts, I just want to sum it up.
You have three nodes

Node-01 GL, ES, Mongo (Master node)
Node-02 GL, ES, Mongo (Not Master node)
Node-03 ES, Mongo
  • Some suggestion on your Graylog Configurations.

In the past I just copied one Graylog node configuration file over to the second Graylog node. On the second node node-02 adjusted the BIND IP Address and If there was TCP/TLS then adjusted the IP Address. That was it.

  • Graylog Configuration file suggestions.

Make sure both Graylog are the same on node-01 & node-02 with a few exceptions.

Your HTTPS connection on Node -01

http_bind_address = node1.xx.xx:9000 <--Good

If your Using TLS/TCP you should have something like this.

http_publish_uri = https:// node1.xx.xx:9000/   <---Use HTTPS /w FQDN.  

You might want to configure as shown above, if not Graylog will use you BIND address which is http:// .
The URL for this should be https://node1.xx.xx:9000 and if the certs are incorrect, you may have other issues.

These setting shown below are for Graylog, not only do you have to configure Elasticsearch YAML file but also must configure Graylog configuration file for all your Elasticsearch/MongoDb nodes.

elasticsearch_hosts = http://node1:9200  <--Yours
elasticsearch_hosts = http://10.200.6.95:9200, http://10.200.6.96:9200, http://10.200.6.97:9200 <-- Perhaps try this suggestion
mongodb_uri = mongodb://localhost/graylog,node2.xx.xx:27017/graylog?replicaSet=rs01  <--You Have this
mongodb_uri = mongodb://10.200.6.92:27017,10.200.6.93:27017,10.200.6.94:27017/graylog?replicaSet=replica01  <--Perhaps try this suggestion

Elasticsearch configuration Check/Suggestion
This setting discovery.zen.ping.unicast.hosts in ES is no longer in version 7.10. As shown below.

discovery.seed_hosts: ["10.0.1.101", "10.0.1.102", "10.0.1.103"]   <--Your ES config file should look something like this.

This is a good read if you haven’t seen this already.

Notes that I have using FQDN in my Graylog environment. So, sum it up, in your /etc/host file this helps that your system knows who these other devices are.
Servers are Add to Local DNS/and have PTR Records, that is if you haven’t done it already. Since I see your using TCL/TLS/SSL the pointer records PTR (Also known as Reverse lookups) is a must.
This will prevent issues later when one node needs to contact another node specially when using Certificates.

Perhaps edit /etc/hosts file like this on all three nodes-01,02,03.

10.10.10.04 Node-001.domain.com
10.10.10.05 Node-002.domain.com
10.10.10.06 Node-003.domain.com

If you second node does show up on the Web UI, I have a feeling that Graylog doesn’t really know there is a second node, MongoDb does hold all the metadata.
What I’m looking for is something to tell us what’s going on, with nothing in the logs, kind hard to tell why Graylog is not displaying your second node, My apologies I haven had this issue yet so I’m just trying to troubleshoot the issue.

EDIT: I did some more research on this. I know some of these post may not be your version, but the issue is the same, pretty much what I explained above.

EDIT2: Out of curiosity what do you see what you execute this on you Graylog server?

curl -X GET 'http://ES-Host:9200/_cat/nodes?v'

Hope that helps

Hello,

Thanks for you response.

Yes, first I updated 3.1 directly to 4.2. Nodes are VM machines so I go back in time in Graylog 3.1 (when all worked fine) and after that do update step-by-step. So starting: 3.1 to 3.2 => OK. After that, I do update 3.2 to 3.3 and after that it was KO and so I answered you (second message in this post).

After graylog updated to 3.3 I change this parameter on 2 nodes:

  • On node 1:
    mongodb_uri = mongodb://localhost:27017,node2.xx.xx:27017,node-quorum.xx.xx:27017/graylog?replicaSet=rs01
  • On node 2:
    mongodb_uri = mongodb://localhost:27017,node1.xx.xx:27017,node-quorum.xx.xx:27017/graylog?replicaSet=rs01

I start Graylog and it’s now OK!

So it was - apparently - an error in this mongoDB connection string in graylog configuration… An incompatibility between Graylog 3.2 and 3.3? In any case it’s working now! :slight_smile:

So after that, I upgraded 3.3 to 4.2 and it’s OK too: cluster ID not changed and cluster with 2 nodes is up and running.

So for Graylog upgrade it’s OK: done 3.1 to 4.2 upgrade :slight_smile:

Now I’ll try to upgrade ES from 6.8.23 to 7.10.2. Hope it will go well…

Thanks again for your research and advices!
Regards,

Hi again,

I was desperate to understand why the cluster ID changed (and Graylog cluster was down). So I continued my research in more depth. Here are my findings

Before change the parameter mongodb_uri in graylog configuration, it was:

  • For node 1:
    mongodb_uri = mongodb://localhost/graylog,node2.xx.xx:27017/graylog?replicaSet=rs01
  • For node 2:
    mongodb_uri = mongodb://localhost/graylog,node1.xx.xx:27017/graylog?replicaSet=rs01

In my logfiles during failed upgrades I find this (about MongoDB databases):
rs01:PRIMARY> show databases;
admin 0.000GB
config 0.000GB
graylog 0.006GB
graylog,node1 0.001GB
graylog,node2 0.002GB
local 0.481GB

So databases “graylog,node1” and “graylog,node2” was created and Graylog no longer used unique “graylog” database! => Each node implicitly (?) created his database (“graylog,node2” for node 1 and “graylog,node1” for node 2 because it was like this in the Graylog configuration) => Cluster ID changed (because not find in these new databases) and so graylog cluster was down too…

By changing parameter mongodb_uri with correct connection string (provide by your response) all 2 Graylog nodes connecting on the same database: “graylog”.
As you can see here I have only 1 “graylog” database in my Mongo Replica Set:
rs01:PRIMARY> show databases;
admin 0.000GB
config 0.000GB
graylog 0.007GB
local 0.484GB

Cluster ID and others Graylog metadata are in this database and so was lost after connecting to others created databases.

Thanks again for your help.
Regards,

Hello,

First of all, thanks for the feed back & added info on this issue. Situations like this interest me on how to resolve this type of issue.

No problem it was a learning experience :slight_smile:

Just make sure you don’t go over ES 7.10 you may want to pin you package to 7.10.

That’s weird :thinking:

No problem :smiley: There was a lot I didn’t understand about your environment configuration, so I tried to point out what might be the issue. Glad it worked, when you get a chance could you mark this post as resolve for future searches?

-Thanks

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.