Unable to run more than 1 datanode in a cluster, other datanodes are showing as unavailable

1. Describe your incident:
I have standard 3 graylog servers and 3 datanodes. During the pre-flight I can only add 1 datanode, as if I have 3 datanodes, they all create their own clusters and break.

I have now created 3 graylog server cluster with 1 datanode, and trying to add the other datanodes to current cluster however they are showing as unavailable.

2. Describe your environment:

  • OS Information: Ubuntu 24.04.2 LTS

  • Package Version: 6.3.1+7bd8532

  • Service logs, configurations, and environment variables:
    On unavailable data node I get the following errors:
    [2025-07-22T10:10:09,847][ERROR][o.o.s.a.BackendRegistry ] [gldatanode2-dcde2] Not yet initialized (you may need to run securityadmin)
    [2025-07-22T10:10:15,713][WARN ][i.n.c.AbstractChannelHandlerContext] [gldatanode2-dcde2] An exception ‘OpenSearchSecurityException[The provided TCP channel is invalid.]; nested: DecoderException[javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]; nested: SSLHandshakeException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]; nested: BadPaddingException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)];’ [enable DEBUG level for full stacktrace] was thrown by a user handler’s exceptionCaught() method while handling the following exception:
    2025-07-22T10:10:15.728Z INFO [OpensearchProcessImpl] io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)
    l] [2025-07-22T10:12:59,625][WARN ][o.o.c.c.ClusterFormationFailureHelper] [gldatanode2-dcde2] cluster-manager not discovered yet: have discovered [{gldatanode2-dcde2}{lvKMsIdYTZaVhuoHmCWPfw}{xRf8DChnRx6ql2J543H72A}{gldatanode2-dcde2}{127.0.1.1:9300}{dir}{shard_indexing_pressure_enabled=true}]; discovery will continue using [127.0.0.1:9300, [::1]:9300, 10.13.18.4:9300] from hosts providers and from last-known cluster state; node term 0, last-accepted version 0 in term 0

3. What steps have you already taken to try and solve the problem?
Reinstalled all 6 servers few times.
Disabled/Enabled firewall and opened ports manually.

4. How can the community help?

I’m unable to find any documentation regarding this error. The connection to database and api works fine. There’s currently no load balancer yet.

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

Hi @duxieking ,
Can you tell us more about your setup? How are you running these datanodes? How are you configuring these? Do they see each other? How are you securing those? Is there any specific or duplicated configuration for each datanode?

Thanks,
Tomas

Hi Tomas

Thank you for getting back to me.

The datanodes and graylog servers are directly deployed. 3 Graylog servers and 3 datanodes in total.

I followed the Ubuntu Installation: Multiple Graylog Nodes guide. It doesn’t let me list the link for some reason sorry. The installation is exactly how its recommended in the guide.

All datanodes work individually but not together. All ports required are accessible from every machine in the cluster and on the network. Both via DNS and IP addresses. Datanodes can access the mongodb replica without any issues.

I currently haven’t implemented any special security features, just what’s on the multiple graylog node installation.

The configuration for datanodes is the following:

#####################################
# GRAYLOG DATANODE CONFIGURATION FILE
#####################################
#
# This is the Graylog DataNode configuration file. The file has to use ISO 8859-1/Latin-1 character encoding.
# Characters that cannot be directly represented in this encoding can be written using Unicode escapes
# as defined in Chapter 3. Lexical Structure, using the \u prefix.
# For example, \u002c.
#
# * Entries are generally expected to be a single line of the form, one of the following:
#
# propertyName=propertyValue
# propertyName:propertyValue
#
# * White space that appears between the property name and property value is ignored,
# so the following are equivalent:
#
# name=Stephen
# name = Stephen
#
# * White space at the beginning of the line is also ignored.
#
# * Lines that start with the comment characters ! or # are ignored. Blank lines are also ignored.
#
# * The property value is generally terminated by the end of the line. White space following the
# property value is not ignored, and is treated as part of the property value.
#
# * A property value can span several lines if each line is terminated by a backslash (‘\’) character.
# For example:
#
*# targetCities=*
*# Detroit,*
*# Chicago,*
# Los Angeles
#
# This is equivalent to targetCities=Detroit,Chicago,Los Angeles (white space at the beginning of lines is ignored).
#
# * The characters newline, carriage return, and tab can be inserted with characters \n, \r, and \t, respectively.
#
# * The backslash character must be escaped as a double backslash. For example:
#
# path=c:\docs\doc1
#

# The auto-generated node ID will be stored in this file and read after restarts. It is a good idea
# to use an absolute file path here if you are starting Graylog DataNode from init scripts or similar.
node_id_file = /etc/graylog/datanode/node-id

# location of your data-node configuration files - put additional files like manually created certificates etc. here
config_location = /etc/graylog/datanode
opensearch_heap = 8g
# You MUST set a secret to secure/pepper the stored user passwords here. Use at least 64 characters.
# Generate one by using for example: pwgen -N 1 -s 96
# ATTENTION: This value must be the same on all Graylog and Datanode nodes in the cluster.
# Changing this value after installation will render all user sessions and encrypted values in the database invalid. (e.g. encrypted access tokens)
password_secret = [redacted]

# The default root user is named ‘admin’
#root_username = admin

# You MUST specify a hash password for the root user (which you only need to initially set up the
# system and in case you lose connectivity to your authentication backend)
# This password cannot be changed using the API or via the web interface. If you need to change it,
# modify it in this file.
# Create one by using for example: echo -n yourpassword | sha256sum
# and put the resulting hash value into the following line
root_password_sha2 =

# connection to MongoDB, shared with the Graylog server
# See Connection Strings - Database Manual - MongoDB Docs for details
mongodb_uri = mongodb://10.13.18.1:27017,10.13.18.2:27017,10.13.18.3:27017/graylog?replicaSet=glset0

#### HTTP bind address
#
# The network interface used by the Graylog DataNode to bind all services.
#
bind_address = 0.0.0.0

#### Hostname
#
# if you need to specify the hostname to use (because looking it up programmatically gives wrong results)
hostname = [hostname of datanode]

#### HTTP port
#
# The port where the DataNode REST api is listening
#
# datanode_http_port = 8999

#### HTTP publish URI
#
# This configuration should be used if you want to connect to this Graylog DataNode’s REST API and it is available on
# another network interface than $http_bind_address,
# for example if the machine has multiple network interfaces or is behind a NAT gateway.
# http_publish_uri =

#### OpenSearch HTTP port
#
# The port where OpenSearch HTTP is listening on
#
# opensearch_http_port = 9200

#### OpenSearch transport port
#
# The port where OpenSearch transports is listening on
#
# opensearch_transport_port = 9300

#### OpenSearch node name config option
#
# use this, if your node name should be different from the hostname that’s found by programmatically looking it up
#
node_name = [name of the node]

#### OpenSearch discovery_seed_hosts config option
#
# if you’re not using the automatic data node setup and want to create a cluster, you have to setup the discovery seed hosts
#
# opensearch_discovery_seed_hosts =

#### OpenSearch initial_manager_nodes config option
#
# if you’re not using the automatic data node setup and want to create a cluster, you have to setup the initial manager nodes
# make sure to remove this setting after the cluster has formed
#
#initial_cluster_manager_nodes = gldatanode1-dcde2
node_roles = data,ingest,remote_cluster_client
#### OpenSearch folders
#
# set these if you need OpenSearch to be located in a special place or want to include an existing version
#
# Root directory of the used opensearch distribution
opensearch_location = /usr/share/graylog-datanode/dist

opensearch_config_location = /var/lib/graylog-datanode/opensearch/config
opensearch_data_location = /var/lib/graylog-datanode/opensearch/data
opensearch_logs_location = /var/log/graylog-datanode/opensearch

#### OpenSearch Certificate bundles for transport and http layer security
#
# if you’re not using the automatic data node setup, you can manually configure your SSL certificates
# transport_certificate = datanode-transport-certificates.p12
# transport_certificate_password = password
# http_certificate = datanode-http-certificates.p12
# http_certificate_password = password

#### OpenSearch log buffers size
#
# the number of lines from stderr and stdout of the OpenSearch process that are buffered inside the DataNode for logging etc.
#
# process_logs_buffer_size = 500

#### OpenSearch JWT token usage
#
# communication between Graylog and OpenSearch is secured by JWT. These are the defaults used for the token usage
# adjust them, if you have special needs.
#
# indexer_jwt_auth_token_caching_duration = 60s
# indexer_jwt_auth_token_expiration_duration = 180s

There are no hostname / ip / mac address duplicates in configuration or in DNS entries.

The guide I followed: Ubuntu Installation: Multiple Graylog Nodes

That’s indeed strange. Your configuration looks ok and the the error message is not saying anything specific that would help us.

I’d remove the hostname from all configuration files, use only node_name, identical with the DNS name of the machine.

If you then start again from scratch, do you see all datanodes in the preflight? And do they successfully obtain certificates?

I see that you configure roles for nodes, could you, initially, comment this setting out, so every node can be a manager? I’d also comment out the initial_cluster_manager_nodes property, as the nodes can self-organize and detect initial manager nodes.

hostname and node_name were originally commented out. I just added them in most recent config to test if it will work but no luck

The initial_cluster_manager_nodes I added during troubleshooting to see if it will join a specific cluster.

I’ve rebuild the cluster twice now and the same issue happens. In preflight I can see all 3 nodes however they go from yellow to red. all 3 get the same logs. If I only use 1 datanode, then it works perfectly fine, it goes green.

It seems like I can only use 1 data node, and adding any other ones just fail. They do obtain the cert but just don’t work in one cluster.

In preflight I can also do the following:

Start 1 datanode, do the cert
Turn off the datanode with the cert, and turn on another one and it goes green.

I can get 3 of them green individually and then if I turn them all on they will show green all together however they create separate clusters and cause issues as you can imagine.

So for some reason theres an issue with them joining one cluster together. They can only work individually :confused:

My best guess is some kind of networking issue between the opensearch nodes. It could also be related to certificates and their configuration, that are generated during the preflight.

Could you, just to be 100% sure, try also 9200 and 9300 ports? nc -zv hostA 9200 and nc -zv hostA 9300

With your current one-node setup, can you generate and upload a support bundle, so we can examine the CA and cert? Cluster Support Bundle

Thanks!

Im sure that they can talk to each other, I confirmed it again and connection works between the nodes.

I can also access the nodes through a browser on windows machine using https://hostname:9200 and the functioning datanode (gldatanode1-dcde2) is showing “Authentication finally failed”.
The unavailable datanode is showing: “OpenSearch Security not initialized.” Maybe that’s a clue I’m not sure ?

Can you let me know how to send the support bundle? I don’t want to attach it in the post due to possible sensitive information

Thanks for confirmation. If the bundle is relatively small, feel free to send it directly to tomas.dvorak@graylog.com. Otherwise you could upload it somewhere password protected and send me the password by mail, if that’s ok.

Thanks for the logs, let’s continue in the discussion here, so my colleagues and other users can follow the debugging. I asked my colleague @matthias_gl to assist us with debugging, especially because I am leaving to vacation today in the evening.

First of all, the mongodb connection seems to cause some repeated problems:

com.mongodb.MongoNodeIsRecoveringException: Command failed with error 91 (ShutdownInProgress): 'The server is in quiesce mode and will shut down' on server 10.13.18.1:27017. The full response is {"topologyVersion": {"processId": {"$oid": "6878f417ace38854fc535e27"}, "counter": 5}, "ok": 0.0, "errmsg": "The server is in quiesce mode and will shut down", "code": 91, "codeName": "ShutdownInProgress", "remainingQuiesceTimeMillis": 4970, "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1753180464, "i": 11}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, "operationTime": {"$timestamp": {"t": 1753180464, "i": 11}}}
	at com.mongodb.internal.connection.ProtocolHelper.createSpecialException(ProtocolHelper.java:264) ~[graylog.jar:?]
	at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:206) ~[graylog.jar:?]

It could be unrelated, but it can also be a sign of something wrong with the setup.

Then one of the datanodes has some issues with its keystore:

`2025-07-17T12:55:24.205Z ERROR [NodePingPeriodical] Uncaught exception in Periodical
java.lang.RuntimeException: org.graylog.datanode.configuration.DatanodeKeystoreException: java.io.IOException: Tag number over 30 is not supported
	at org.graylog.datanode.configuration.DatanodeKeystore.getCertificateExpiration(DatanodeKeystore.java:192) ~[graylog-datanode.jar:?]
	at org.graylog.datanode.periodicals.NodePingPeriodical.doRun(NodePingPeriodical.java:146) ~[graylog-datanode.jar:?]
	at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:99) [graylog2-server-6.3.1.jar:?]
	at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedPeriodicRunnable.run(InstrumentedScheduledExecutorService.java:264) [metrics-core-4.2.30.jar:4.2.30]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
	at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
	at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.graylog.datanode.configuration.DatanodeKeystoreException: java.io.IOException: Tag number over 30 is not supported
	at org.graylog.datanode.configuration.DatanodeKeystore.loadKeystore(DatanodeKeystore.java:162) ~[graylog-datanode.jar:?]
	at org.graylog.datanode.configuration.DatanodeKeystore.getCertificateExpiration(DatanodeKeystore.java:184) ~[graylog-datanode.jar:?]
	... 9 more
Caused by: java.io.IOException: Tag number over 30 is not supported
	at java.base/sun.security.util.DerValue.<init>(Unknown Source) ~[?:?]
	at java.base/sun.security.util.DerValue.<init>(Unknown Source) ~[?:?]
	at java.base/sun.security.pkcs12.PKCS12KeyStore.engineLoad(Unknown Source) ~[?:?]
	at java.base/sun.security.util.KeyStoreDelegator.engineLoad(Unknown Source) ~[?:?]
	at java.base/java.security.KeyStore.load(Unknown Source) ~[?:?]
	at org.graylog.datanode.configuration.DatanodeKeystore.loadKeystore(DatanodeKeystore.java:159) ~[graylog-datanode.jar:?]
	at org.graylog.datanode.configuration.DatanodeKeystore.getCertificateExpiration(DatanodeKeystore.java:184) ~[graylog-datanode.jar:?]

You can try to investigate if the keystore is readable and valid. To do that, you can use the standard jdk keytool command. The keystore is encrypted with the password_secret value you have configured in the datanode.conf.

You could also remove the keystore file completely and restart the datanode. It should be regenerated and populated with fresh certificates, if you have automatic cert renewal configured during the preflight.

Then, and I think here we are approaching the problem really closely, there is this error:

2025-07-25T12:02:51.099Z INFO  [OpensearchProcessImpl] [2025-07-25T12:02:51,098][WARN ][o.o.d.HandshakingTransportAddressConnector] [gldatanode2-dcde2] [connectToRemoteMasterNode[10.13.18.4:9300]] completed handshake with [{gldatanode1-dcde2}{x6bC4DaoThexkLMnRvDWuA}{7bD_4TqiRKaQscp5avV6RQ}{gldatanode1-dcde2}{127.0.1.1:9300}{dimr}{shard_indexing_pressure_enabled=true}] but followup connection failed
2025-07-25T12:02:51.099Z INFO  [OpensearchProcessImpl] org.opensearch.transport.ConnectTransportException: [gldatanode1-dcde2][127.0.1.1:9300] connect_exception
2025-07-25T12:02:51.099Z INFO  [OpensearchProcessImpl] 	at org.opensearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1110) ~[opensearch-2.15.0.jar:2.15.0]

It seems that your other datanode is thinking that the first one is available on [gldatanode1-dcde2][127.0.1.1:9300]. As if the publish address was configured wrongly somewhere, using localhost IP. Could you possibly dump the content of graylog.datanodes collection in mongodb, so we see what’s recorded there and which addresses are your datanodes providing?

Please see the screenshot for datanodes table

I’m really not sure why its trying to access it on localhost… no where in my config does it mention that and in database its using a hostname

Just to make extra sure. Your /etc/hosts on gldatanode2 really has

127.0.1.1 gldatanode2-dcde2

and not

127.0.1.1 gldatanode1-dcde2

?

does a ping gldatanode1-dcde2 on node2 resolve the correct ip?

host file is correct:

The ping is showing correct address

I don’t think this is a network issue, as all the nodes talk together without any problems, and there’s no DNS issues.

I think its more of a cert/config/bug problem

Could you please restart the UNAVAILABLE data node and post/send the startup logs? Unfortunately these were not visible in the previous logs.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.