Failled to add new Datanode

Etny · May 6, 2025, 1:12pm

1. Describe your incident:
I’ve got 3 VM.

graylog-server + mongod
graylog-datanode : datanode-1
graylog-datanode : datanode-2

When i connect the second node i’ve got this issue :

2. Describe your environment:

OS Information: Debian 12 + graylog
graylog-datanode 6.2.1-1
graylog-server 6.1.10-1
mongodb-org 7.0.20
Service logs, configurations, and environment variables:
graylog-server : server.conf

is_leader = true
node_id_file = /etc/graylog/server/node-id
password_secret = 3sCyBEmyLNNwR.........38tZ2dl
root_password_sha2 = 2cb4b1431b84ec15d35ed8........9cc4b25c8d879ae23e18
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
stream_aware_field_types=false
disabled_retention_strategies = none,close
allow_leading_wildcard_searches = false
allow_highlighting = false
field_value_suggestion_mode = on
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://localhost/graylog
mongodb_max_connections = 1000
integrations_scripts_dir = /usr/share/graylog-server/scripts
http_bind_address = 0.0.0.0:9000
message_journal_max_age = 12h
message_journal_max_size = 3gb

graylog-datanode-1 / 2 : datanode.conf

[sudo] Mot de passe de user :
node_id_file = /etc/graylog/datanode/node-id
config_location = /etc/graylog/datanode
password_secret = 3sCyBEmyLNNwRDp1WaDGU0rWKDF9uIgWRHA7Id6PmonEmC3SjkMqv1JZ8TlMHfLODLIgn7xkOfSvMsu3GJWI5y5A938tZ2dl
root_password_sha2 =
mongodb_uri = mongodb://172.28.128.150:27017/graylog
opensearch_location = /usr/share/graylog-datanode/dist
opensearch_config_location = /var/lib/graylog-datanode/opensearch/config
opensearch_data_location = /var/lib/graylog-datanode/opensearch/data
opensearch_logs_location = /var/log/graylog-datanode/opensearch

logs datanode-2

14:48:05.718 [opensearch[datanode-2][transport_worker][T#1]] ERROR org.opensearch.transport.netty4.ssl.SecureNetty4Transport - Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching datanode-1 found.
javax.net.ssl.SSLHandshakeException: No subject alternative DNS name matching datanode-1 found.
	at java.base/sun.security.ssl.Alert.createSSLException(Alert.java:130) ~[?:?]
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:378) ~[?:?]
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:321) ~[?:?]
	at java.base/sun.security.ssl.TransportContext.fatal(TransportContext.java:316) ~[?:?]
	at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1318) ~[?:?]

[2025-05-06T15:37:10,755][ERROR][o.o.t.n.s.SecureNetty4Transport] [datanode-2] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

3. What steps have you already taken to try and solve the problem?
I tried to renew the certificate by clicking “renew-certificate” on the web interface.

4. How can the community help?
I don’t understand what to do to solve this issue.

Thanks all folks

Tdvorak · May 6, 2025, 2:26pm

Hi @Etny,

First of all I’d unify graylog-server and graylog-datanode packages to the same version, preferably 6.2. This will help us with debugging and makes sure you have all APIs in sync. Could you please upgrade your server?

How is your networking configured? Do all nodes see each other, accessible by the name? Can you ping the datanode-1 from datanode-2?

Etny · May 7, 2025, 1:33pm

Hi Tdvorak,
I restarted the Graylog services (datanode, server, and mongod), tested connectivity with ping, but in the end, a reboot solved the issue.
After that, I upgraded my setup as you suggested — and tada! Everything is green now!

However, all my messages seems to failled :

  2025-05-07T16:42:03.042+02:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for index set <File Beat index/681b5392484710fa3d48a873>
org.graylog.shaded.opensearch2.org.opensearch.OpenSearchException: Unable to retrieve field types of index filebeat_index_23
        at org.graylog.storage.opensearch2.OpenSearchClient.exceptionFrom(OpenSearchClient.java:211) ~[?:?]
        at org.graylog.storage.opensearch2.OpenSearchClient.execute(OpenSearchClient.java:153) ~[?:?]
        at org.graylog.storage.opensearch2.OpenSearchClient.executeRequest(OpenSearchClient.java:172) ~[?:?]
        at org.graylog.storage.opensearch2.mapping.FieldMappingApi.fieldTypes(FieldMappingApi.java:51) ~[?:?]
        at org.graylog.storage.opensearch2.IndexFieldTypePollerAdapterOS2.pollIndex(IndexFieldTypePollerAdapterOS2.java:60) ~[?:?]
        at org.graylog2.indexer.fieldtypes.IndexFieldTypePoller.pollIndex(IndexFieldTypePoller.java:94) ~[graylog.jar:?]
        at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$poll$5(IndexFieldTypePollerPeriodical.java:205) ~[graylog.jar:?]
        at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) [graylog.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.graylog.shaded.opensearch2.org.opensearch.client.ResponseException: method [GET], host [https://datanode-1:9200], URI [/filebeat_index_23/_mapping], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [filebeat_index_23]","index":"filebeat_index_23","resource.id":"filebeat_index_23","resource.type":"index_or_alias","index_uuid":"_na_"}],"type":"index_not_found_exception","reason":"no such index [filebeat_index_23]","index":"filebeat_index_23","resource.id":"filebeat_index_23","resource.type":"index_or_alias","index_uuid":"_na_"},"status":404}
        at org.graylog.shaded.opensearch2.org.opensearch.client.RestClient.convertResponse(RestClient.java:479) ~[?:?]
        at org.graylog.shaded.opensearch2.org.opensearch.client.RestClient.performRequest(RestClient.java:371) ~[?:?]
        at org.graylog.shaded.opensearch2.org.opensearch.client.RestClient.performRequest(RestClient.java:346) ~[?:?]
        at org.graylog.storage.opensearch2.OpenSearchClient.lambda$executeRequest$7(OpenSearchClient.java:173) ~[?:?]
        at org.graylog.storage.opensearch2.OpenSearchClient.execute(OpenSearchClient.java:151) ~[?:?]
        ... 12 more
2025-05-07T16:42:05.043+02:00 WARN  [IndexFieldTypePollerPeriodical] Active write index for index set "Graylog Events" (6819d107ca61280408d40272) doesn't exist yet
2025-05-07T16:42:08.043+02:00 WARN  [IndexFieldTypePollerPeriodical] Active write index for index set "Default index set" (6819d106ca61280408d401b3) doesn't exist yet
2025-05-07T16:42:09.042+02:00 WARN  [IndexFieldTypePollerPeriodical] Active write index for index set "File Beat index" (681b5392484710fa3d48a873) doesn't exist yet
2025-05-07T16:42:11.043+02:00 WARN  [IndexFieldTypePollerPeriodical] Active write index for index set "Graylog System Events" (6819d107ca61280408d40275) doesn't exist yet
2025-05-07T16:42:14.044+02:00 WARN  [IndexFieldTypePollerPeriodical] Active write index for index set "Default index set" (6819d106ca61280408d401b3) doesn't exist yet
user@graylog-server:~$ ping datanode-1
PING datanode-1 (172.28.128.151) 56(84) bytes of data.
64 bytes from datanode-1 (172.28.128.151): icmp_seq=1 ttl=64 time=0.153 ms
64 bytes from datanode-1 (172.28.128.151): icmp_seq=2 ttl=64 time=0.190 ms
^X64 bytes from datanode-1 (172.28.128.151): icmp_seq=3 ttl=64 time=0.177 ms
^C
--- datanode-1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2029ms
rtt min/avg/max/mdev = 0.153/0.173/0.190/0.015 ms
user@graylog-server:~$ ping datanode-1^?2
ping: datanode-12: Nom ou service inconnu
user@graylog-server:~$ ping datanode-2
PING datanode-2 (172.28.128.152) 56(84) bytes of data.
64 bytes from datanode-2 (172.28.128.152): icmp_seq=1 ttl=64 time=0.117 ms
64 bytes from datanode-2 (172.28.128.152): icmp_seq=2 ttl=64 time=0.156 ms
^X64 bytes from datanode-2 (172.28.128.152): icmp_seq=3 ttl=64 time=0.224 ms
^C
--- datanode-2 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2053ms
rtt min/avg/max/mdev = 0.117/0.165/0.224/0.044 ms
user@graylog-server:~$

user@datanode-1:~$ sudo systemctl status graylog-datanode.service
● graylog-datanode.service - Graylog data node
     Loaded: loaded (/lib/systemd/system/graylog-datanode.service; enabled; preset: enabled)
     Active: active (running) since Tue 2025-05-06 14:34:04 UTC; 24h ago
       Docs: http://docs.graylog.org/
   Main PID: 883 (java)
      Tasks: 223 (limit: 19134)
     Memory: 2.9G
        CPU: 29min 3.005s
     CGroup: /system.slice/graylog-datanode.service
             ├─ 883 /usr/share/graylog-datanode/jvm/bin/java -Dlog4j.configurationFile=file:///etc/graylog/datanode/log>
             └─1323 /usr/share/graylog-datanode/dist/opensearch-2.15.0-linux-x64/jdk/bin/java -Xshare:auto -Dopensearch>

mai 06 14:34:04 datanode-1 systemd[1]: Started graylog-datanode.service - Graylog data node.

I only have the default stream and default index set. I don’t understand why it’s failing.

Tdvorak · May 8, 2025, 5:48am

Hey, good to hear that your networking issues are solved now!

Regarding the indexing failures - I’d suggest rotating your indices (System - indices - open one index set - top right “Maintenance” button).

It seems that due to the initial error, some of the indices and aliases are not correctly initialized. Rotating them could fix these problems.

Etny · May 9, 2025, 7:18am

Hi,
I tried rotating but i still have the same errors.

user@datanode-1:~$ tail -n 20 /var/log/graylog-datanode/datanode.log
2025-05-09T08:14:42.295+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:14:42,295][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:19:42.295+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:19:42,295][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:24:42.296+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:24:42,295][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:29:42.297+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:29:42,296][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:33:56.986+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:33:56,986][WARN ][o.o.s.a.BackendRegistry  ] [datanode-1] Authentication finally failed for null from 172.28.128.150:36472
2025-05-09T08:33:56.997+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:33:56,997][WARN ][o.o.s.a.BackendRegistry  ] [datanode-1] Authentication finally failed for null from 172.28.128.150:36472
2025-05-09T08:34:41.808+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,808][INFO ][o.o.i.i.IndexStateManagementHistory] [datanode-1] Deleting old history indices viz []
2025-05-09T08:34:41.808+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,808][INFO ][o.o.i.i.IndexStateManagementHistory] [datanode-1] .opendistro-ism-managed-index-history-write not rolled over. Conditions were: {[max_docs: 2500000]=false, [max_age: 1d]=false}
2025-05-09T08:34:41.811+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,811][INFO ][o.o.t.t.CronTransportAction] [datanode-1] Start running hourly cron.
2025-05-09T08:34:41.811+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,811][INFO ][o.o.a.t.ADTaskManager    ] [datanode-1] Start to maintain running historical tasks
2025-05-09T08:34:41.813+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,811][INFO ][o.o.t.c.HourlyCron       ] [datanode-1] Hourly maintenance succeeds
2025-05-09T08:34:42.298+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:42,298][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:39:42.298+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:39:42,298][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:44:42.299+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:44:42,299][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:49:42.299+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:49:42,299][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:54:42.301+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:54:42,301][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T08:59:42.302+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:59:42,302][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T09:04:42.303+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T09:04:42,303][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T09:09:42.304+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T09:09:42,303][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep
2025-05-09T09:14:42.304+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T09:14:42,304][INFO ][o.o.j.s.JobSweeper       ] [datanode-1] Running full sweep

user@datanode-2:~$ tail -n 20 /var/log/graylog-datanode/datanode.log
2025-05-09T08:09:42.030+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:09:42,030][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:14:42.031+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:14:42,031][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:19:42.032+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:19:42,031][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:24:42.032+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:24:42,032][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:29:42.033+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:29:42,032][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:33:56.993+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:33:56,993][WARN ][o.o.s.a.BackendRegistry  ] [datanode-2] Authentication finally failed for null from 172.28.128.150:52780
2025-05-09T08:34:41.614+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,614][INFO ][o.o.i.i.IndexStateManagementHistory] [datanode-2] Deleting old history indices viz []
2025-05-09T08:34:41.615+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,614][INFO ][o.o.i.i.IndexStateManagementHistory] [datanode-2] .opendistro-ism-managed-index-history-write not rolled over. Conditions were: {[max_docs: 2500000]=false, [max_age: 1d]=false}
2025-05-09T08:34:41.621+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,621][INFO ][o.o.t.t.CronTransportAction] [datanode-2] Start running hourly cron.
2025-05-09T08:34:41.622+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,621][INFO ][o.o.a.t.ADTaskManager    ] [datanode-2] Start to maintain running historical tasks
2025-05-09T08:34:41.622+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:41,622][INFO ][o.o.t.c.HourlyCron       ] [datanode-2] Hourly maintenance succeeds
2025-05-09T08:34:42.033+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:34:42,033][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:39:42.033+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:39:42,033][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:44:42.034+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:44:42,034][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:49:42.035+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:49:42,034][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:54:42.035+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:54:42,035][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T08:59:42.036+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T08:59:42,036][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T09:04:42.037+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T09:04:42,037][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T09:09:42.038+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T09:09:42,037][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep
2025-05-09T09:14:42.038+02:00 INFO  [OpensearchProcessImpl] [2025-05-09T09:14:42,038][INFO ][o.o.j.s.JobSweeper       ] [datanode-2] Running full sweep

user@graylog-server:~$ tail -n 20 /var/log/graylog-server/server.log
        at org.graylog.storage.opensearch2.mapping.FieldMappingApi.fieldTypes(FieldMappingApi.java:51) ~[?:?]
        at org.graylog.storage.opensearch2.IndexFieldTypePollerAdapterOS2.pollIndex(IndexFieldTypePollerAdapterOS2.java:60) ~[?:?]
        at org.graylog2.indexer.fieldtypes.IndexFieldTypePoller.pollIndex(IndexFieldTypePoller.java:94) ~[graylog.jar:?]        at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$poll$5(IndexFieldTypePollerPeriodical.java:205) ~[graylog.jar:?]
        at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:241) [graylog.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.base/java.util.concurrent.FutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.base/java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.graylog.shaded.opensearch2.org.opensearch.client.ResponseException: method [GET], host [https://datanode-1:9200], URI [/filebeat_index_23/_mapping], status line [HTTP/1.1 404 Not Found]
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [filebeat_index_23]","index":"filebeat_index_23","resource.id":"filebeat_index_23","resource.type":"index_or_alias","index_uuid":"_na_"}],"type":"index_not_found_exception","reason":"no such index [filebeat_index_23]","index":"filebeat_index_23","resource.id":"filebeat_index_23","resource.type":"index_or_alias","index_uuid":"_na_"},"status":404}
        at org.graylog.shaded.opensearch2.org.opensearch.client.RestClient.convertResponse(RestClient.java:479) ~[?:?]
        at org.graylog.shaded.opensearch2.org.opensearch.client.RestClient.performRequest(RestClient.java:371) ~[?:?]
        at org.graylog.shaded.opensearch2.org.opensearch.client.RestClient.performRequest(RestClient.java:346) ~[?:?]
        at org.graylog.storage.opensearch2.OpenSearchClient.lambda$executeRequest$7(OpenSearchClient.java:173) ~[?:?]
        at org.graylog.storage.opensearch2.OpenSearchClient.execute(OpenSearchClient.java:151) ~[?:?]
        ... 12 more
2025-05-09T09:17:31.043+02:00 WARN  [IndexFieldTypePollerPeriodical] Active write index for index set "Default index set" (6819d106ca61280408d401b3) doesn't exist yet

Etny · May 12, 2025, 8:03am

Hi
Does anybody have a idea to help me?
I don’t know where to search to solve this issue :

Timestamp	Index	Letter ID	Error message
5 minutes ago	graylog_deflector	01JV1QTWVE000019T7MNVVMJ49	[graylog_deflector] OpenSearchException[OpenSearch exception [type=index_not_found_exception, reason=no such index [graylog_deflector]]]

Here are my indices :

Thank you.

Tdvorak · May 12, 2025, 9:41am

Could you please open in your browser this URL and post the results here? http://your_graylog_server:9000/api/datanodes/any/opensearch/_cat/indices

Then we can see if you have some missing indices in your opensearch. I assume that during the initial problems, you have some inconsistencies between the mongodb state and opensearch indices.

Etny · May 12, 2025, 12:05pm

here it is :

green  open investigation_event_index_0    fiQj5v4WQC2seVQs7sR73g 1 0     0 0    208b    208b
green  open .opendistro_security           e0kng6uHRfSMuHRReuWchQ 1 0    10 0  68.5kb  68.5kb
yellow open .ds-gl-datanode-metrics-000001 CzFLABSvR4GDEQVjJnBQfQ 1 1  2814 0 719.5kb 719.5kb
green  open graylog_6                      B_BtivgpTrOKmO8vxtk1yg 1 0     0 0    208b    208b
yellow open .ds-gl-datanode-metrics-000002 T-rMzn5IRk63v2Y6L1kfhQ 1 1 14478 0     3mb     3mb
green  open graylog_1                      aGZwYlZlQAKc094NT2CCKA 1 0     0 0    208b    208b
green  open graylog_0                      Py0IH56HTh2vPBB1mVwhuw 1 0  2063 0   954kb   954kb
green  open gl-system-events_0             4pL3x0g8Ts2s5eYPPhOkhw 1 0     5 0  54.2kb  54.2kb
green  open graylog_5                      8YCmyU0zQ06TZkHIuZj1vQ 1 0  2515 0   1.4mb   1.4mb
green  open graylog_4                      OhBbti6BTIu8XdzRN8uX1A 1 0     0 0    208b    208b
green  open graylog_3                      SdOK0ZJyQZC6CgAB1pkpGg 1 0     0 0    208b    208b
green  open graylog_2                      _xuNaxvuT7KSSZYBr0DAzQ 1 0     0 0    208b    208b
green  open gl-events_0                    TOET9On6TsCJ4W_k3Q5q7w 1 0     0 0    208b    208b
green  open .plugins-ml-config             v_YSiBpKSb-ZbjkyIDCjaw 1 0     1 0   3.9kb   3.9kb
yellow open .opendistro-job-scheduler-lock TXuuBH6PRligRuFbihGezA 1 1     2 0  16.9kb  16.9kb

Tdvorak · May 12, 2025, 4:36pm

Ok, thank you!

I’d suggest to delete the file beat index set and if you don’t need it. If yes, you can always re-create it later. This should stop at least a portion of errors you see in your logs. Than we can continue with whatever’s left there.

Etny · May 13, 2025, 9:02am

Thank you for your help! I deleted the index as you suggested.

However, the more I work on this setup, the more I realize there’s a lot I don’t fully understand.
Here are my nodes :

I’m not sure it’s normal for both of my nodes to be cluster_manager and remote_cluster_client at the same time.
What I don’t understand is how logs are stored between datanode1 and datanode2.
I noticed that the indices are different on each node:

Could you explain how this works, or point me to some documentation that could help?

Tdvorak · May 14, 2025, 6:59am

I guess that during the initial setup, where you had only one node available, some indices were created in it and not properly replicated to the other node afterwards. The opensearch cluster will be probably in yellow or red state, right?

You are mentioning that both nodes think they are cluster_manager, so it seems like they actually formed two independent clusters?

My suggestion would be to delete content of /data and /config directories in both data nodes, delete content of mongodb and restart everything. This will get you again to the preflight, gives you a fresh start, without all the problems that would be quite complex to fix now.

Additionally, I’d suggest adding a third data node. Two nodes are not enough for a stable cluster. They are prone to split-brain problems, which could be what we see in your situation.

Etny · May 14, 2025, 2:07pm

You’re right, my cluster was yellow.

So I did what you suggested:
I deleted the contents of the data and config directories.

I did the same for datanode-2 and datanode-3.

root@datanode-1:/var/lib/graylog-datanode/opensearch# ls -la
total 16
drwxr-xr-x 4 root             root             4096 14 mai   13:31 .
drwxr-xr-x 4 graylog-datanode graylog-datanode 4096  6 mai   09:05 ..
drwxr-xr-x 2 graylog-datanode graylog-datanode 4096 14 mai   13:31 config
drwxr-xr-x 2 graylog-datanode graylog-datanode 4096 14 mai   13:37 data

root@graylog-server:/var/lib/mongodb# rm -r *
root@graylog-server:/var/lib/mongodb# ls -la
total 20
drwxr-xr-x  2 mongodb mongodb 16384 14 mai   13:42 .
drwxr-xr-x 26 root    root     4096  6 mai   08:41

I also added a third data node, as you recommended.
Then I rebooted everything.
Now, graylog-server, mongodb, and all graylog-datanode services are up and running.

And then……

user@graylog-server:~$ sudo tail -n 7 /var/log/graylog-server/server.log

========================================================================================================

2025-05-14T15:49:04.101+02:00 INFO  [CustomCAX509TrustManager] CA changed, refreshing trust manager
2025-05-14T15:49:13.657+02:00 INFO  [CaKeystore] Signing certificate for  node cfbabaf3-e873-4977-901c-14f13b302b11, subject: CN=datanode-3
2025-05-14T15:49:13.689+02:00 INFO  [CaKeystore] Signing certificate for  node 4f15e49d-84d3-4dc7-9c13-49153bac64b9, subject: CN=datanode-1
2025-05-14T15:49:13.707+02:00 INFO  [CaKeystore] Signing certificate for  node e899cf63-a07f-45e4-99af-18f8716188e9, subject: CN=datanode-2

user@graylog-server:~$ sudo tail -n 7 /var/log/mongodb/mongod.log
{"t":{"$date":"2025-05-14T13:49:04.757+00:00"},"s":"I",  "c":"NETWORK",  "id":6788700, "ctx":"conn38","msg":"Received first command on ingress connection since session start or auth handshake","attr":{"elapsedMillis":0}}
{"t":{"$date":"2025-05-14T13:49:05.016+00:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"172.28.128.151:60852","uuid":{"uuid":{"$uuid":"9e3a69c2-0639-4d79-8c47-dc35eedb477e"}},"connectionId":39,"connectionCount":20}}
{"t":{"$date":"2025-05-14T13:49:05.016+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn39","msg":"client metadata","attr":{"remote":"172.28.128.151:60852","client":"conn39","negotiatedCompressors":[],"doc":{"driver":{"name":"mongo-java-driver|legacy","version":"5.4.0"},"os":{"type":"Linux","name":"Linux","architecture":"amd64","version":"6.1.0-34-amd64"},"platform":"Java/Eclipse Adoptium/17.0.14+7"}}}
{"t":{"$date":"2025-05-14T13:49:05.017+00:00"},"s":"I",  "c":"NETWORK",  "id":6788700, "ctx":"conn39","msg":"Received first command on ingress connection since session start or auth handshake","attr":{"elapsedMillis":0}}
{"t":{"$date":"2025-05-14T13:49:36.408+00:00"},"s":"I",  "c":"WTCHKPT",  "id":22430,   "ctx":"Checkpointer","msg":"WiredTiger message","attr":{"message":{"ts_sec":1747230576,"ts_usec":408638,"thread":"741:0x7f46cd6666c0","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_CHECKPOINT_PROGRESS","category_id":6,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"saving checkpoint snapshot min: 1036, snapshot max: 1036 snapshot count: 0, oldest timestamp: (0, 0) , meta checkpoint timestamp: (0, 0) base write gen: 1"}}}
{"t":{"$date":"2025-05-14T13:50:36.444+00:00"},"s":"I",  "c":"WTCHKPT",  "id":22430,   "ctx":"Checkpointer","msg":"WiredTiger message","attr":{"message":{"ts_sec":1747230636,"ts_usec":444697,"thread":"741:0x7f46cd6666c0","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_CHECKPOINT_PROGRESS","category_id":6,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"saving checkpoint snapshot min: 1221, snapshot max: 1221 snapshot count: 0, oldest timestamp: (0, 0) , meta checkpoint timestamp: (0, 0) base write gen: 1"}}}
{"t":{"$date":"2025-05-14T13:51:36.465+00:00"},"s":"I",  "c":"WTCHKPT",  "id":22430,   "ctx":"Checkpointer","msg":"WiredTiger message","attr":{"message":{"ts_sec":1747230696,"ts_usec":465771,"thread":"741:0x7f46cd6666c0","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_CHECKPOINT_PROGRESS","category_id":6,"verbose_level":"DEBUG_1","verbose_level_id":1,"msg":"saving checkpoint snapshot min: 1402, snapshot max: 1402 snapshot count: 0, oldest timestamp: (0, 0) , meta checkpoint timestamp: (0, 0) base write gen: 1"}}}

Even though everything seems to be working fine, I’m still getting these error lines:


/var/log/graylog-datanode/datanode.log:2025-05-14T15:54:46.130+02:00 INFO  [OpensearchProcessImpl] [2025-05-14T15:54:46,114][ERROR][o.o.t.n.s.SecureNetty4Transport] [datanode-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown
/var/log/graylog-datanode/datanode.log:2025-05-14T15:54:46.131+02:00 INFO  [OpensearchProcessImpl] [2025-05-14T15:54:46,114][ERROR][o.o.t.n.s.SecureNetty4Transport] [datanode-1] Exception during establishing a SSL connection: javax.net.ssl.SSLHandshakeException: Insufficient buffer remaining for AEAD cipher fragment (2). Needs to be more than tag size (16)

I understand this is probably a certificate issue — but I literally deleted everything.
So… what am I supposed to do now, besides smashing my keyboard?

Tdvorak · May 14, 2025, 2:22pm

Ok, let’s double check everything.

Are now running the same graylog server and data node versions?
Have you deleted content of mongodb before starting all services?
Is your CA self-signed, created during the preflight, or are you using a custom CA?
Are you using any custom JVM or those bundled with graylog server and data node? Any JAVA_HOME or OPENSEARCH_JAVA_HOME env properties?
Can you upload full data node logs?

Etny · May 15, 2025, 9:21am

Hello !

Are now running the same graylog server and data node versions?
Yes, absolutely… Not!
Due to all my tests and snapshot restorations, the nodes and server were not on the same version.
I’ve now upgraded everything to version 6.2.2-1.

Have you deleted content of mongodb before starting all services?
Yes, here’s what I did:

Shut down the data node, Graylog server, and MongoDB
On the data node: deleted data and config files
On the server: deleted the contents of /var/lib/mongodb
Rebooted all VMs
Launched the preflight web interface
Created the CA
Configured the renewal policy
Provisioned the certificates
Prayed
no changes

image716×404 19.8 KB

Are you using any custom JVM or those bundled with graylog server and data node? Any JAVA_HOME or OPENSEARCH_JAVA_HOME env properties?
No custom JVM.
I just installed Debian 12 and followed the Graylog installation instructions from the official documentation.

A few moments later… i’ve got a clue.

When i run only one datanode, it seems to work fine :

I add one node, click “restart the configuration”, still looks good :

I add the third one…

I delete the first two…

I turn the first two back on…

image1851×130 9.18 KB
Click “Restart Configuration”

image405×116 2.94 KB
And… still the same issue as at the beginning of the week!

image447×283 10.4 KB

Maintenance operations like recalculate or rotate are painless.

If i clear everythings and restart the preflight…

The only thing that always works is with one node without any problem !

Tdvorak · May 16, 2025, 5:55am

Thanks for all the information!

At this point, my assumption is that, for some reason, your nodes form two different clusters. You create a 2-node cluster first and when the third node starts, it’s unable to join the cluster. Then, you end up in a strange situation with 2 partially working clusters. I believe we can confirm that in logs, but I would need full datanode logs of all three nodes.

I think you can also try to start all three nodes at once and only then go to the preflight and start with provisioning.

The cluster_manager role you see in the cluster overview is just a list of roles - meaning every node can become a manager, not that every is actually a manager.

This call can give us also some hints: http://your_graylog_server:9000/api/datanodes/any/opensearch/_cluster/state/

Etny · May 19, 2025, 10:08am

Hello Tomas! How was your weekend?
Thanks for your help!

Here’s what I did this morning:

Stopped: graylog-server, mongodb, graylog-datanode

Cleared: logs, data, config, MongoDB

Stopped: all VMs

Started: all datanode VMs

Waited

Started: server + MongoDB VM

Launched web interface

Provisioned certificates

Waited...

Failed.

You can find logs herehttps://we.tl/t-soEL7ZMJjD

Tdvorak · May 19, 2025, 11:37am

Thank you, I had a nice weekend, hope you did as well! I am sorry to see you still having these problems. But with your logs, we can continue debugging and hopefully we’ll fix that.

Before we dive too deep, one problem I just discovered:

➜  Partage cat Datanode-3/datanode.conf | grep node_name 
node_name = datanode-2

Your configuration file for the 3rd node contains datanode-2 as name. This could explain why two nodes are fine but adding this third confuses and crashes the whole cluster? The provisioning will also generate wrong certificates.

Can this be the cause? Before spending more time on this, can you please double check this?

Thank you!

Etny · May 19, 2025, 12:03pm

Well, bad news… I made a mistake when I sent the files. On my Graylog data node, the file was actually correct. Sorry for the confusion.

Tdvorak · May 19, 2025, 12:26pm

Ok, no problem.

2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] Caused by: java.security.cert.CertificateException: No subject alternative DNS name matching datanode-1 found.
2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] 	at java.base/sun.security.util.HostnameChecker.matchDNS(HostnameChecker.java:207) ~[?:?]
2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] 	at java.base/sun.security.util.HostnameChecker.match(HostnameChecker.java:103) ~[?:?]
2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] 	at java.base/sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:457) ~[?:?]
2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] 	at java.base/sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509TrustManagerImpl.java:431) ~[?:?]
2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] 	at java.base/sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:291) ~[?:?]
2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] 	at java.base/sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:144) ~[?:?]
2025-05-19T11:55:44.338+02:00 INFO  [OpensearchProcessImpl] 	at java.base/sun.security.ssl.CertificateMessage$T13CertificateConsumer.checkServerCerts(CertificateMessage.java:1296) ~[?:?]

Let’s talk about this exception. It could mean a mismatch between the name that you are giving the node and the hostname that’s actually resolved and used, internally or externally, in your VM setup.

Recently I’ve seen something similar, related to kubernetes.

The solution for the user was to set the node_name to the fqdn of the node and suddenly everything started to work fine. So maybe we should continue with talking about networking, hostnames and how are your nodes communicating?

Tdvorak · May 19, 2025, 12:29pm

Here’s the related issue, just in case Data Node certificates should contain DNS Short Names in SAN · Issue #22598 · Graylog2/graylog2-server · GitHub

Topic		Replies	Views
Unable to run more than 1 datanode in a cluster, other datanodes are showing as unavailable Graylog Central (peer support) architecture , data-node	14	43	July 29, 2025
Graylog 6.1 datanode certificate provisioning renders system dead Graylog Central (peer support)	20	367	March 17, 2025
Still can't seem to perform a Datanode migration Graylog Central (peer support) data-node	15	75	June 19, 2025
Datanode not listening on 9200 - no opensearch Graylog Central (peer support) dashboards	24	1038	March 21, 2025
CA and CSR issues Graylog Central (peer support)	12	735	February 10, 2025

Failled to add new Datanode

Related topics