Cluster manager not discovered exception

accidentaladmin · March 15, 2024, 7:49pm

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:

I am getting daily errors such as follows:

OpenSearchException[OpenSearch exception [type=cluster_manager_not_discovered_exception, reason=FailedToCommitClusterStateException[publication failed]; 
nested: OpenSearchException[publication cancelled before committing: timed out after 30s];]]; 
nested: OpenSearchException[OpenSearch exception [type=failed_to_commit_cluster_state_exception, reason=publication failed]]; 
nested: OpenSearchException[OpenSearch exception [type=exception, reason=publication cancelled before committing: timed out after 30s]];

This has been attached to multiple streams with multiple indices.

2. Describe your environment:

OS Information:
Debian 11
Package Version:
Graylog 5.2.5+7eaa89d
Service logs, configurations, and environment variables:

The above is the extent.

3. What steps have you already taken to try and solve the problem?

None

4. How can the community help?

Does anyone know if I should be concerned? I have another issue about which I will post shortly but, in sum, a previously functional stream no longer commits messages to the relevant index.

Joel_Duffield · March 16, 2024, 10:38pm

Both your issues point to opensearch problems, how many nodes do you have in your opensearch cluster, and how close are they to each other (how many network hops between them)

accidentaladmin · March 17, 2024, 7:24pm

3 opensearch nodes (all three data, one also marked cluster manager]. All three reside in the same subnet on the same ProxMox host. Each node has 32GB RAM (heap set to 16GB) and 16 cores.

One thing that may cause issues (that I am now just realizing) is that each node is identified within the Graylog server conf by their FQDN as opposed to IP address.

Joel_Duffield · March 17, 2024, 8:08pm

Depending on how resilient the dns is, ip may be more reliable. Do you mean you have purposely made only one manager, or that ine was elected.

accidentaladmin · March 17, 2024, 8:22pm

Purposely tagged one as manager.
I.e.

graynode-0: node.roles: [cluster_manager, data]
graynode-1: node.roles: [ data ]
graynode-2: node.roles: [ data ]

I’m guessing that was wrong?

Joel_Duffield · March 18, 2024, 10:51am

Its generally not required to have dedicated managers until you get pretty large, and you definitely want more that one, so ya i would give them all the role.

accidentaladmin · March 18, 2024, 1:37pm

Cool; ill give that a shot

accidentaladmin · March 19, 2024, 3:19pm

So upon giving that a try:

Would it be better to not have any cluster_managers affirmatively declared?

kpearson · March 21, 2024, 10:19am

It looks like an OpenSearch cluster formation issue. Can you verify that all your OpenSearch nodes can reach each other and that the configuration is correct? Do you use security enabled on OpenSearch or disabled?

It will help to give the best support if you show your OpenSearch configuration.

I noticed all the logs in your screenshot are coming from graynode-0.

accidentaladmin · March 21, 2024, 1:55pm

Thank you for the response. My configuration is as follows:
1 x Graylog Server (grayserver-0) in a Debian 11 LXC guest, on a PVE host
1 x mongodb instance running in the above-referenced container alongside Graylog-Server
3 x Opensearch Nodes, each in a Debian 12 LXC guest, on the same PVE host

All services are in the same subnet on the same PVE host.

The PVE Host:
13th Gen Intel Core i9-13900K
128GB DDR4 RAM
proxmox pve 6.5.11-8-pve

Each Opensearch Node:
16x vCPU (all p cores)
32GB RAM
Java Heap = 16GB
Swap off

Graylog Server
16x vCPU (all p cores)
16GB RAM
Java Heap = 8GB

Here are my Opensearch configs
Graynode-0

cluster.name: graynode-cluster
node.name: graynode-0
node.roles: [ cluster_manager, data ]
node.attr.temp: hot

path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
network.host: 192.168.128.111
http.port: 9200
discovery.seed_hosts: ["192.168.128.111", "192.168.128.112", "192.168.128.113"]
cluster.initial_cluster_manager_nodes: ["graynode-0", "graynode-1", "graynode-2"]

plugins.security.ssl.transport.pemcert_filepath: graynode-0.pem
plugins.security.ssl.transport.pemkey_filepath: graynode-0-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: foo_ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: graynode-0.pem
plugins.security.ssl.http.pemkey_filepath: graynode-0-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: foo_ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
  - CN=admin,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.nodes_dn:
  - CN=graynode-0.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-1.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-2.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".plugins-ml-conversation-meta", ".plugins-ml-conversation-interactions", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models", ".geospatial-ip2geo-data*"]
node.max_local_storage_nodes: 3
action.auto_create_index: false
plugins.security.ssl.http.clientauth_mode: OPTIONAL

Graynode-1

cluster.name: graynode-cluster
node.name: graynode-1
node.roles: [ cluster_manager, data ]
node.attr.temp: hot

path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
network.host: 192.168.128.112
http.port: 9200
discovery.seed_hosts: ["192.168.128.111", "192.168.128.112", "192.168.128.113"]
cluster.initial_cluster_manager_nodes: ["graynode-0", "graynode-1", "graynode-2"]

plugins.security.ssl.transport.pemcert_filepath: graynode-1.pem
plugins.security.ssl.transport.pemkey_filepath: graynode-1-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: foo_ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: graynode-1.pem
plugins.security.ssl.http.pemkey_filepath: graynode-1-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: foo_ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
  - CN=admin,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.nodes_dn:
  - CN=graynode-0.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-1.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-2.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".plugins-ml-conversation-meta", ".plugins-ml-conversation-interactions", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models", ".geospatial-ip2geo-data*"]
node.max_local_storage_nodes: 3
action.auto_create_index: false
plugins.security.ssl.http.clientauth_mode: OPTIONAL

Graynode-2

cluster.name: graynode-cluster
node.name: graynode-2
node.roles: [ cluster_manager, data ]
node.attr.temp: cold

path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
network.host: 192.168.128.113
http.port: 9200
discovery.seed_hosts: ["192.168.128.111", "192.168.128.112", "192.168.128.113"]
cluster.initial_cluster_manager_nodes: ["graynode-0", "graynode-1", "graynode-2"]

plugins.security.ssl.transport.pemcert_filepath: graynode-2.pem
plugins.security.ssl.transport.pemkey_filepath: graynode-2-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: foo_ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: graynode-2.pem
plugins.security.ssl.http.pemkey_filepath: graynode-2-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: foo_ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
  - CN=admin,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.nodes_dn:
  - CN=graynode-0.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-1.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-2.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".plugins-ml-conversation-meta", ".plugins-ml-conversation-interactions", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models", ".geospatial-ip2geo-data*"]
node.max_local_storage_nodes: 3
action.auto_create_index: false
plugins.security.ssl.http.clientauth_mode: OPTIONAL

kpearson · March 21, 2024, 3:01pm

Ah, ok, here is probably the issue; in all 3, you have node.name: graynode-0 so they cannot form the cluster; try changing this to node.name: graynode-1 etc and restart, then do a curl on the cluster endpoint just like https://<IP>:9200 with the user and password and see if the cluster shows all 3 nodes.

accidentaladmin · March 22, 2024, 12:59am

Oooof. This is embaressing but not for the reason you think. I pasted the same config for all three and then modified to match the corresponding config on each node. I just forgot to change Graynode-1’s and Graynode-2’s, respectively. I have since corrected.

TL;DR: the error you identified only existed here, not in the actual configs. Sorry!

kpearson · March 22, 2024, 11:59am

Can you revert to the single cluster manager and check if all nodes are in the cluster? I see nothing wrong here, but I could be missing it.

accidentaladmin · March 22, 2024, 12:26pm

That was my configuration at setup. It worked for a time and then the subject of this thread started occurring. So, the multiple cluster_managers was tried to resolve that issue.

Do you think my storage setup could be the problem?

These containers reside on a RAIDz1

kpearson · March 22, 2024, 1:06pm

The reason I ask you to check if all nodes are in the cluster when you have a single master set is that in your first screenshot, it says your node could not discover a master node and it cannot join a cluster. Your latest ones say that a master cannot be elected, which happens when you do not have enough master nodes to make a quorum. Also, the screenshot only shows logs from a single OpenSearch node. It would be helpful to understand whether your 3 nodes are actually in a cluster; in the first state, you could ingest logs to your cluster, but in the current state, you would not, as OpenSearch is not functioning.

accidentaladmin · March 22, 2024, 1:17pm

Okay, I think I understand and maybe I am conflating terms and this is causing confusion.

There is One Graylog server (which is grayserver-0) which has access to three Opensearch Nodes (0, 1, 2). The screenshot shows the Graylog Server. I’ve never see the graylog gui refer to the individual Opensearch Nodes.

For what it is worth, here is my Graylog Server config in relevant part:

is_leader = true

node_id_file = /etc/graylog/server/node-id

password_secret = <redacted>

root_username = breakincase

root_password_sha2 = <redacted>

bin_dir = /usr/share/graylog-server/bin

data_dir = /var/lib/graylog-server

plugin_dir = /usr/share/graylog-server/plugin


http_bind_address = 192.168.128.114:9000

http_publish_uri = https://grayserver-0.foo.bar:9000/

http_external_uri = http://graylog.foo.world/

http_enable_cors = true

http_enable_tls = true

http_tls_cert_file = /etc/certs/grayserver-0.pem

http_tls_key_file = /etc/certs/keys/grayserver-0.key

stream_aware_field_types=false

trusted_proxies = 192.168.128.0/24,10.12.1.5/32

elasticsearch_hosts = https://graylog:<redacted>@graynode-0.foo.bar:9200, https://graylog:<redacted>@graynode-1.foo.bar:9200, https://graylog:<redacted>@graynode-2.foo.bar:9200

accidentaladmin · March 22, 2024, 1:19pm

With respect to selecting a single manager, should I also modify it so only 1 OpenSearch node is marked as eligible in: cluster.initial_cluster_manager_nodes:?

kpearson · March 22, 2024, 1:39pm

Oh! wait, I read the error message after you updated wrong; this error is actually saying no Graylog Leader, not no OpenSearch leader; it clicked when you wrote that grayserver-0 is your Graylog server; I had OpenSearch on the brain because of the original error.

Let’s decouple this into two parts: OpenSearch and Graylog. First, with the updated OpenSearch config, is OpenSearch running correctly, and is data being ingested? Right now, your config looks correct, and in production set-ups, you almost always want 3 master nodes.

If OpenSearch is working correctly you do not need to do anything here, the question becomes why Graylog is giving the NO_LEADER error on a single node when you have is_leader = true defined.

accidentaladmin · March 22, 2024, 2:16pm

So decoupled, yes, Opensearch appears to be running silky smooth (no errors in over 24 hours) *EDIT: reported in the GUI, that is

As to the No_Leader that may be the true question. Let me investigate the logs …

Edit: If you are so inclined
Server Log:
https://file.io/ZDsmK5seW5rd

Password: "Graylog 5.2.5+7eaa89d " (no quotes) (yes thats a space at the end)

gsmith · March 28, 2024, 4:34am

Hey @accidentaladmin

Did this resolve your issue? I curious because I’m setting up a cluster on my Prox server.

Topic		Replies	Views
Odd Pipeline/Stream Behavior (Part Deux) Graylog Central (peer support)	3	182	March 31, 2024
[REQ] Question about Error Messages - ElasticsearchException on Graylog 4.1.x Graylog Central (peer support)	9	702	February 25, 2022
OpenSearch Issues / Search Dashboard empty Graylog Central (peer support)	6	150	October 27, 2024
OpenSearch exception Graylog Central (peer support)	14	510	October 20, 2023
Graylog Data Node Cluster Cert Errors Graylog Central (peer support) data-node	4	341	February 21, 2025

Cluster manager not discovered exception

Related topics