Cluster manager not discovered exception

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:

I am getting daily errors such as follows:

OpenSearchException[OpenSearch exception [type=cluster_manager_not_discovered_exception, reason=FailedToCommitClusterStateException[publication failed]; 
nested: OpenSearchException[publication cancelled before committing: timed out after 30s];]]; 
nested: OpenSearchException[OpenSearch exception [type=failed_to_commit_cluster_state_exception, reason=publication failed]]; 
nested: OpenSearchException[OpenSearch exception [type=exception, reason=publication cancelled before committing: timed out after 30s]];

This has been attached to multiple streams with multiple indices.

2. Describe your environment:

  • OS Information:
    Debian 11
  • Package Version:
    Graylog 5.2.5+7eaa89d
  • Service logs, configurations, and environment variables:

The above is the extent.

3. What steps have you already taken to try and solve the problem?

None

4. How can the community help?

Does anyone know if I should be concerned? I have another issue about which I will post shortly but, in sum, a previously functional stream no longer commits messages to the relevant index.

Both your issues point to opensearch problems, how many nodes do you have in your opensearch cluster, and how close are they to each other (how many network hops between them)

3 opensearch nodes (all three data, one also marked cluster manager]. All three reside in the same subnet on the same ProxMox host. Each node has 32GB RAM (heap set to 16GB) and 16 cores.

One thing that may cause issues (that I am now just realizing) is that each node is identified within the Graylog server conf by their FQDN as opposed to IP address.

Depending on how resilient the dns is, ip may be more reliable. Do you mean you have purposely made only one manager, or that ine was elected.

Purposely tagged one as manager.
I.e.

graynode-0: node.roles: [cluster_manager, data]
graynode-1: node.roles: [ data ]
graynode-2: node.roles: [ data ]

I’m guessing that was wrong?

Its generally not required to have dedicated managers until you get pretty large, and you definitely want more that one, so ya i would give them all the role.

Cool; ill give that a shot

So upon giving that a try:

Would it be better to not have any cluster_managers affirmatively declared?

It looks like an OpenSearch cluster formation issue. Can you verify that all your OpenSearch nodes can reach each other and that the configuration is correct? Do you use security enabled on OpenSearch or disabled?

It will help to give the best support if you show your OpenSearch configuration.

I noticed all the logs in your screenshot are coming from graynode-0.

Thank you for the response. My configuration is as follows:
1 x Graylog Server (grayserver-0) in a Debian 11 LXC guest, on a PVE host
1 x mongodb instance running in the above-referenced container alongside Graylog-Server
3 x Opensearch Nodes, each in a Debian 12 LXC guest, on the same PVE host

All services are in the same subnet on the same PVE host.

The PVE Host:
13th Gen Intel Core i9-13900K
128GB DDR4 RAM
proxmox pve 6.5.11-8-pve

Each Opensearch Node:
16x vCPU (all p cores)
32GB RAM
Java Heap = 16GB
Swap off

Graylog Server
16x vCPU (all p cores)
16GB RAM
Java Heap = 8GB

Here are my Opensearch configs
Graynode-0

cluster.name: graynode-cluster
node.name: graynode-0
node.roles: [ cluster_manager, data ]
node.attr.temp: hot

path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
network.host: 192.168.128.111
http.port: 9200
discovery.seed_hosts: ["192.168.128.111", "192.168.128.112", "192.168.128.113"]
cluster.initial_cluster_manager_nodes: ["graynode-0", "graynode-1", "graynode-2"]

plugins.security.ssl.transport.pemcert_filepath: graynode-0.pem
plugins.security.ssl.transport.pemkey_filepath: graynode-0-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: foo_ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: graynode-0.pem
plugins.security.ssl.http.pemkey_filepath: graynode-0-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: foo_ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
  - CN=admin,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.nodes_dn:
  - CN=graynode-0.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-1.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-2.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".plugins-ml-conversation-meta", ".plugins-ml-conversation-interactions", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models", ".geospatial-ip2geo-data*"]
node.max_local_storage_nodes: 3
action.auto_create_index: false
plugins.security.ssl.http.clientauth_mode: OPTIONAL

Graynode-1

cluster.name: graynode-cluster
node.name: graynode-1
node.roles: [ cluster_manager, data ]
node.attr.temp: hot

path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
network.host: 192.168.128.112
http.port: 9200
discovery.seed_hosts: ["192.168.128.111", "192.168.128.112", "192.168.128.113"]
cluster.initial_cluster_manager_nodes: ["graynode-0", "graynode-1", "graynode-2"]

plugins.security.ssl.transport.pemcert_filepath: graynode-1.pem
plugins.security.ssl.transport.pemkey_filepath: graynode-1-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: foo_ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: graynode-1.pem
plugins.security.ssl.http.pemkey_filepath: graynode-1-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: foo_ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
  - CN=admin,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.nodes_dn:
  - CN=graynode-0.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-1.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-2.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".plugins-ml-conversation-meta", ".plugins-ml-conversation-interactions", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models", ".geospatial-ip2geo-data*"]
node.max_local_storage_nodes: 3
action.auto_create_index: false
plugins.security.ssl.http.clientauth_mode: OPTIONAL

Graynode-2

cluster.name: graynode-cluster
node.name: graynode-2
node.roles: [ cluster_manager, data ]
node.attr.temp: cold

path.data: /var/lib/opensearch
path.logs: /var/log/opensearch
network.host: 192.168.128.113
http.port: 9200
discovery.seed_hosts: ["192.168.128.111", "192.168.128.112", "192.168.128.113"]
cluster.initial_cluster_manager_nodes: ["graynode-0", "graynode-1", "graynode-2"]

plugins.security.ssl.transport.pemcert_filepath: graynode-2.pem
plugins.security.ssl.transport.pemkey_filepath: graynode-2-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: foo_ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.pemcert_filepath: graynode-2.pem
plugins.security.ssl.http.pemkey_filepath: graynode-2-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: foo_ca.pem
plugins.security.allow_unsafe_democertificates: false
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
  - CN=admin,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.nodes_dn:
  - CN=graynode-0.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-1.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
  - CN=graynode-2.foo.local,OU=IT,O=Foo Bar LLC,L=Anytown,ST=Serenity Now,C=US
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".plugins-ml-conversation-meta", ".plugins-ml-conversation-interactions", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models", ".geospatial-ip2geo-data*"]
node.max_local_storage_nodes: 3
action.auto_create_index: false
plugins.security.ssl.http.clientauth_mode: OPTIONAL

Ah, ok, here is probably the issue; in all 3, you have node.name: graynode-0 so they cannot form the cluster; try changing this to node.name: graynode-1 etc and restart, then do a curl on the cluster endpoint just like https://<IP>:9200 with the user and password and see if the cluster shows all 3 nodes.

Oooof. This is embaressing but not for the reason you think. I pasted the same config for all three and then modified to match the corresponding config on each node. I just forgot to change Graynode-1’s and Graynode-2’s, respectively. I have since corrected.

TL;DR: the error you identified only existed here, not in the actual configs. Sorry!

Can you revert to the single cluster manager and check if all nodes are in the cluster? I see nothing wrong here, but I could be missing it.

That was my configuration at setup. It worked for a time and then the subject of this thread started occurring. So, the multiple cluster_managers was tried to resolve that issue.

Do you think my storage setup could be the problem?

These containers reside on a RAIDz1

The reason I ask you to check if all nodes are in the cluster when you have a single master set is that in your first screenshot, it says your node could not discover a master node and it cannot join a cluster. Your latest ones say that a master cannot be elected, which happens when you do not have enough master nodes to make a quorum. Also, the screenshot only shows logs from a single OpenSearch node. It would be helpful to understand whether your 3 nodes are actually in a cluster; in the first state, you could ingest logs to your cluster, but in the current state, you would not, as OpenSearch is not functioning.

Okay, I think I understand and maybe I am conflating terms and this is causing confusion.

There is One Graylog server (which is grayserver-0) which has access to three Opensearch Nodes (0, 1, 2). The screenshot shows the Graylog Server. I’ve never see the graylog gui refer to the individual Opensearch Nodes.

For what it is worth, here is my Graylog Server config in relevant part:

is_leader = true

node_id_file = /etc/graylog/server/node-id

password_secret = <redacted>

root_username = breakincase

root_password_sha2 = <redacted>

bin_dir = /usr/share/graylog-server/bin

data_dir = /var/lib/graylog-server

plugin_dir = /usr/share/graylog-server/plugin


http_bind_address = 192.168.128.114:9000

http_publish_uri = https://grayserver-0.foo.bar:9000/

http_external_uri = http://graylog.foo.world/

http_enable_cors = true

http_enable_tls = true

http_tls_cert_file = /etc/certs/grayserver-0.pem

http_tls_key_file = /etc/certs/keys/grayserver-0.key

stream_aware_field_types=false

trusted_proxies = 192.168.128.0/24,10.12.1.5/32

elasticsearch_hosts = https://graylog:<redacted>@graynode-0.foo.bar:9200, https://graylog:<redacted>@graynode-1.foo.bar:9200, https://graylog:<redacted>@graynode-2.foo.bar:9200

With respect to selecting a single manager, should I also modify it so only 1 OpenSearch node is marked as eligible in: cluster.initial_cluster_manager_nodes:?

Oh! wait, I read the error message after you updated wrong; this error is actually saying no Graylog Leader, not no OpenSearch leader; it clicked when you wrote that grayserver-0 is your Graylog server; I had OpenSearch on the brain because of the original error.

Let’s decouple this into two parts: OpenSearch and Graylog. First, with the updated OpenSearch config, is OpenSearch running correctly, and is data being ingested? Right now, your config looks correct, and in production set-ups, you almost always want 3 master nodes.

If OpenSearch is working correctly you do not need to do anything here, the question becomes why Graylog is giving the NO_LEADER error on a single node when you have is_leader = true defined.

So decoupled, yes, Opensearch appears to be running silky smooth (no errors in over 24 hours) *EDIT: reported in the GUI, that is

As to the No_Leader that may be the true question. Let me investigate the logs …

Edit: If you are so inclined
Server Log:
https://file.io/ZDsmK5seW5rd

Password: "Graylog 5.2.5+7eaa89d " (no quotes) (yes thats a space at the end)

Hey @accidentaladmin

Did this resolve your issue? I curious because I’m setting up a cluster on my Prox server.