Can't form cluster because of missing leader

Hi,
I’m trying to setup a Graylog-Cluster for testing purposes.
The Replica Set for MongoDB as well as the Opensearch Cluster seems to work just fine, but unfortunately clustering graylog is unsuccessful:

This message appears at System/Overview although I have the is_leader = true set on one node.
When I start only one node, there’s no warning but the moment I bring the second and third node up, Graylog starts flapping under System/Nodes. In the node overview I see only one node, not three. Usually it’s the last one I restarted, so I see the graylog3 for example from the graylog1 (but not the graylog1 itself at that moment).
The actual leader node is switching, sometimes it’s graylog1, sometimes graylog2 and sometimes graylog3. There’s always one node at a time displayed in System/Nodes, not two or more at once.

My setup:
It’s a three node setup:
graylog1 → Graylog Open 5.2.3, MongoDB 6.0.13, Opensearch 2.11.1
graylog2 → Graylog Open 5.2.3, MongoDB 6.0.13, Opensearch 2.11.1
graylog3 → Graylog Open 5.2.3, MongoDB 6.0.13, Opensearch 2.11.1
All are running on Alma Linux 9.3, which are deployed in VirtualBox. There’s no specific DNS-Server configured, the resolution for graylog[1-3].xxx.de is set in /etc/hosts

server.conf:

is_leader = true
node_id_file = /etc/graylog/server/node-id
password_secret = XXXXXXXXXXXXXXXXXXXXXXX
root_password_sha2 = XXXXXXXXXXXXXXXXXX
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 192.168.178.129:9000
http_external_uri = http://graylog.xxx.de/
stream_aware_field_types=false
elasticsearch_hosts = http://graylog1.xxx.de:9200,http://graylog2.xxx.de:9200,http://graylog3.xxx.de:9200
disabled_retention_strategies = none
allow_leading_wildcard_searches = false
allow_highlighting = false
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
stale_leader_timeout = 10000
mongodb_uri = mongodb://graylog:graylog@graylog1.xxx.de:27017,graylog2.xxx.de:27017,graylog3.xxx.de:27017/graylog?replicaSet=rs01
mongodb_max_connections = 1000

This looks (almost) exactly the same on all three nodes, except for the is_leader = false on graylog2 and graylog3.

Opensearch.yml:

cluster.name: graylog
node.name: graylog1.xxx.de
path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
network.host: 192.168.178.129
http.port: 9200
discovery.seed_hosts: ["graylog2.xxx.de", "graylog3.xxx.de"]
cluster.initial_master_nodes: ["graylog1.xxx.de", "graylog2.xxx.de", "graylog3.xxx.de"]
action.auto_create_index: false
plugins.security.disabled: true
node.roles: ['data', 'master']

Might there be a problem, having the “plugins.security.disabled: true” after all the other “plugin.security.*”-Configurations?

Output of the Opensearch Cluster-Health:

{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 6,
  "active_shards" : 11,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

I tried:

  • Setting the http_publish_uri
  • Setting the http_external_uri to the Domain to the Apache which should load balance, I haven’t set that up, but it shouldn’t be a problem for clustering issues?
  • Using is_master instead of is_leader
  • Adjusting the Opensearch-Cluster (adding the node.roles, setting the cluster.initial_master_nodes)
  • Time should be in sync with chrony?
  • I can curl the graylog2.xxx.de:9000 from graylog1 and graylog3 and the other ways around
  • Starting one by one, starting graylog2 and graylog3 simultaneously

I can provide the log files from /var/log/graylog-server/server.log, but there are no errors and everything looks good.

I’m new to Graylog (at least setting it up), so I’m open to every suggestion having a better configuration.
Do you have any idea how to get the cluster rolling?
The problem seems to be (as Graylog says), that the three nodes can’t elect a leader, but I don’t know why, since I have the is_leader = true set.

Regards

Ya I think they aren’t talking to each other. What is the bind address, publish Uri, and external Uri set to on each of the three nodes server.conf file?

graylog1:

http_bind_address = 192.168.178.129:9000
http_publish_uri = http://graylog1.xxx.de/
http_external_uri = http://graylog.xxx.de/

graylog2:

http_bind_address = 192.168.178.130:9000
http_publish_uri = http://graylog2.xxx.de/
http_external_uri = http://graylog.xxx.de/

graylog3:

http_bind_address = 192.168.178.131:9000
http_publish_uri = http://graylog3.xxx.de/
http_external_uri = http://graylog.xxx.de/

I also tried for the publish_uri:

If I go to the Web UI on (for example) graylog1, I can still see one of the other nodes at System/Nodes.

This Screenshot is taken on graylog1.
The node you can see there is flapping while starting graylog-server on the other nodes, but after a few seconds/a minute it seems to always display graylog3 (because it’s the last node where I restarted graylog-server). If I have all three nodes running and I reboot graylog2, that’s the one I usually see at System/Nodes. The star indicating the leader node is switching, which means I sometimes see it on the node displayed and sometimes I don’t (I assume while I don’t see it, the leader is one of the nodes not displayed)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.