Graylog Data Node Cluster Cert Errors

We converted a 2 node manually configured OpenSearch cluster to Graylog Data Nodes this week, but now can’t get more than one node up in the cluster. The other node failed to join the cluster with numerous certificate errors, so we tried creating a brand new Data Node server with a fresh install and it also will not join the cluster with the same certificate errors. We are left with only one out of 3 Data Nodes up and running.

  • OS Information: Ubuntu Server 22.04 LTS
  • Package Version: Graylog 6.1.5
  • Service logs, configurations, and environment variables:

An exception 'OpenSearchSecurityException[The provided TCP channel is invalid.]; nested: DecoderException[javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]

Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown

An exception 'OpenSearchSecurityException[The provided TCP channel is invalid.]; nested: DecoderException[javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching syslog-opensearch-01 found


Tried re-issuing certificates to the affected node, tried adding a completely new data node server, tried figuring out how to add OpenSearch configuration parameters to try and set plugins.security.ssl.transport.enforce_hostname_verification = false but could not figure out how to pass OpenSearch parameters through to the automatically generated opensearch.yml

Is there a known issue with Data Nodes and not supporting multiple nodes in a cluster, or is there additional configuration required to enable multi-node clusters of Data Nodes?

1 Like

Hello @sysadm1,

Generally your setup should work fine with 2 or more nodes. That’s a supported and recommended setup.

From your stack trace:

Can you tell me how your setup looks like? Do you have several machines with unique host names? How do you resolve hostnames? Is this some kind of containerized setup?

Data node will forward any opensearch. prefixed env property to the underlying opensearch (by adding this property to the generated opensearch.yml). So you can set

opensearch.plugins.security.ssl.transport.enforce_hostname_verification = false

as an env property to achieve that. But that seems like a workaround for a problem around hostname resolution or the way how your nodes are trying to connect to each other. I’d recommend fixing that issue first, the rest will then work without additional configuration.

Best regards,
Tomas

1 Like

We have a single node Graylog server, and then three OpenSearch cluster nodes that are all separate virtual machines running Ubuntu 22.04 LTS. DNS resolution works fine between all the servers, I think that error is related the DNS subject alternative name in the certificate, not DNS resolution of the hostnames. This is a more complete log snippet that shows these TLS handshake errors between Data Nodes:

[WARN ][i.n.c.AbstractChannelHandlerContext] [syslog-opensearch-03] An exception 'OpenSearchSecurityException[The provided TCP channel is invalid.]; nested: DecoderException[javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]; nested: SSLHandshakeException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)]; nested: BadPaddingException[Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)];'```

[WARN ][i.n.c.AbstractChannelHandlerContext] [syslog-opensearch-03] An exception 'OpenSearchSecurityException[The provided TCP channel is invalid.]; nested: DecoderException[javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching syslog-opensearch-01 found.]; nested: SSLHandshakeException[No subject alternative DNS name matching syslog-opensearch-01 found.]; nested: CertificateException[No subject alternative DNS name matching syslog-opensearch-01 found.];'

We have reverted the configuration back to using these nodes in a manually configured OpenSearch cluster instead of being Data Nodes as a test and they are working fine now, it appears that the problem is being introduced with the TLS certs or security settings that are provisioned by the Data Node.

Data nodes get their SAN by looking up the hostname related to the bind_address configuration (by default 0.0.0.0), if not overriden by hostname in the datanode.conf.

So if you believe that the alternative name in the certificate is wrong, then it makes sense to look at the above and try to understand where is the name coming from and why it’s not correct.

Additionally I’d check that node_name is unique for each node (if explicitly defined) and there is no copy-pasted configuration that would lead to incorrectly assigned certificates.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.