"Authentication Finally Failed"

mc114 · April 8, 2025, 2:35pm

1. Describe your incident:
After running for 10 minutes, my leader graylgo node starts throwing authentication errors against my two Graylog-Datanode backend servers. There is a string of Java errors beginning with:
graylog-server Caused by: org.graylog.shaded.opensearch2.org.opensearch.client.ResponseException: method [GET], host [https://datanode-2:9200], URI [/p_f_237/_stats/store], status line [HTTP/1.1 401 Unauthorized]
and * graylog-server 2025-04-08T10:26:49.436+01:00 WARN [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://Datanode-1:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
2025-04-08 10:25:57.481*
And then a graylog-server Authentication finally failed

This happens every minute. The system is mostly working, though widgets / searches will occasionally return a 401 Unauthorized / Authentication finally failed" until refreshed.

The problem doesn’t seem to be present on the secondary graylog web node.

2. Describe your environment:

OS Information: RHEL 9.5
Package Version: graylog-server-6.1.10 & graylog-datanode-6.1.10
Service logs, configurations, and environment variables:
I can zip up some logs and post them if needs be, but extended snippets are in the first reply:
server.conf
is_leader = true
node_id_file = /etc/graylog/server/node-id
password_secret = REMOVED
root_password_sha2 = REMOVED
bin_dir = /usr/share/graylog-server/bin
data_dir = /opt/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 10.181.144.15:9000
http_publish_uri = https://graylog.domain.com:9000/
http_enable_tls = true
http_tls_cert_file = /etc/graylog/graylog.pem
http_tls_key_file = /etc/graylog/graylog.key
stream_aware_field_types=false
disabled_retention_strategies = none,close
allow_leading_wildcard_searches = false
allow_highlighting = false
field_value_suggestion_mode = on
output_batch_size = 5000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /opt/graylog-server/journal
message_journal_max_age = 24h
message_journal_max_size = 60gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog_user:password@graylog1/graylog
mongodb_max_connections = 1000
transport_email_enabled = true
transport_email_hostname = mailrelay.mail.com
transport_email_port = 25
transport_email_use_auth = false

datanode.conf
node_id_file = /etc/graylog/datanode/node-id
config_location = /etc/graylog/datanode
password_secret = REMOVED
root_password_sha2 = REMOVED
mongodb_uri = mongodb://graylog_user:password@graylog1/graylog
opensearch_location = /usr/share/graylog-datanode/dist
opensearch_config_location = /opt/graylog-datanode/opensearch/config
opensearch_data_location = /opt/graylog-datanode/opensearch/data
opensearch_logs_location = /var/log/graylog-datanode/opensearch
opensearch_heap = 24g

3. What steps have you already taken to try and solve the problem?
Checked logs, checked services, checked configs, all of these seem to be using the corect strings and SHA’s and passwords. Nothing seems to be different between the erroring and non-erroring graylog node except for the “is_leader” on the leader node.

4. How can the community help?

I’m not really sure why these errors are coming or what they mean and how I can fix them sorry, which is why i’m here!

Tdvorak · April 15, 2025, 9:06am

Hi @mc114,
The communication between graylog server and data node is using JWT auth. JWT tokens are using password_secret and have some caches and timeouts, but none of it is matching your 10 minutes, they expire sooner. Otherwise the Authentication finally failed would signalize problems with JWT. Here are my suggestions to check:

Double-check that your password_secret is set to the very same value on every node.
Check data node logs if there are some opensearch stacktraces suggesting possible problems
Is there any proxy involved anywhere in the communication?
Could you post full stacktrace of the error, so I know which part of the code base is triggering the request?

Another thing, maybe not relevant to the problem - if you are running a two-node data node cluster, you should add at least one more node. Two nodes are prone to failures, split-brain issues and are not recommended setup.

Thanks,
Tomas

mc114 · April 22, 2025, 8:44am

Hi @Tdvorak thanks for geting back to me. sorry for taking so long to reply - i’ve been on holiday !

I’ll see If i can get a 3rd node into the cluster, we have a dev front/backend setup on ‘free’ redhat dev licences which i could potentially break up and move one of the boxes into the live cluster.

I’ve been through the datanode logs and there are only several flavours of “authenticaion failed” in both the opensearch and datanode logs:

2025-04-22T08:28:56.431Z INFO [OpensearchProcessImpl] [2025-04-22T08:28:56,431][WARN ][o.o.s.a.BackendRegistry ] [RNGDC1PGLS01.rng.reddenorthgate.com] Authentication finally failed for null from 10.181.144.15:43942
2025-04-22T08:28:57.074Z INFO [OpensearchProcessImpl] [2025-04-22T08:28:57,074][WARN ][o.o.s.a.BackendRegistry ] [RNGDC1PGLS01.rng.reddenorthgate.com] Authentication finally failed for null from 10.181.144.15:43942
2025-04-22T08:28:57.389Z INFO [OpensearchProcessImpl] [2025-04-22T08:28:57,389][WARN ][o.o.s.a.BackendRegistry ] [RNGDC1PGLS01.rng.reddenorthgate.com] Authentication finally failed for null from 10.181.144.15:43942

and:

type": “server”, “timestamp”: “2025-04-22T08:25:52,590Z”, “level”: “WARN”, “component”: “o.o.s.a.BackendRegistry”, “cluster.name”: “datanode-cluster”, “node.name”: “RNGDC1PGLS01.rng.reddenorthgate.com”, “message”: “Authentication finally failed for null from 10.181.144.15:60236”, “cluster.uuid”: “zET-JFeuSkOn3YMjqGYrvw”, “node.id”: “KEYKdfuVTpytmfAAmlA1ZQ” }
{“type”: “server”, “timestamp”: “2025-04-22T08:25:52,770Z”, “level”: “WARN”, “component”: “o.o.s.a.BackendRegistry”, “cluster.name”: “datanode-cluster”, “node.name”: “RNGDC1PGLS01.rng.reddenorthgate.com”, “message”: “Authentication finally failed for null from 10.181.144.15:60236”, “cluster.uuid”: “zET-JFeuSkOn3YMjqGYrvw”, “node.id”: “KEYKdfuVTpytmfAAmlA1ZQ” }

The server.log is much more descriptive and has plenty of errors and stacktraces inside it. i’ve snipped a bit out here, and will work on uploading the full log once i can move it off the lan.


2025-04-22T09:19:04.968+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #8).
2025-04-22T09:19:05.184+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:05.306+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #9).
2025-04-22T09:19:05.903+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS01.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #10).
2025-04-22T09:19:06.080+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #12).
2025-04-22T09:19:06.184+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:07.042+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #11).
2025-04-22T09:19:07.180+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:08.162+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:08.162+01:00 WARN  [IndexRotationThread] Elasticsearch cluster isn't healthy. Skipping index rotation.
2025-04-22T09:19:08.180+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:08.184+01:00 ERROR [VersionProbe] Unable to retrieve version from indexer node RNGDC1PGLS02.rng.reddenorthgate.com:9200: unknown error - an exception occurred while deserializing error response: {}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Authentication': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (okio.Buffer$inputStream$1); line: 1, column: 16]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2572) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2598) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2606) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:765) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3659) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2747) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:867) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:753) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._initForReading(ObjectReader.java:357) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:2115) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1501) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:33) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:23) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeSingleHost$3(VersionProbe.java:149) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.rootResponse(VersionProbe.java:208) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeSingleHost(VersionProbe.java:159) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeAllHosts$2(VersionProbe.java:125) ~[graylog.jar:?]
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[?:?]
        at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.evaluate(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.findFirst(Unknown Source) ~[?:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeAllHosts(VersionProbe.java:127) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probe$1(VersionProbe.java:107) ~[graylog.jar:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) [graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:107) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:84) [graylog.jar:?]
        at org.graylog2.periodical.ESVersionCheckPeriodical.doRun(ESVersionCheckPeriodical.java:104) [graylog.jar:?]
        at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:99) [graylog.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.base/java.lang.Thread.run(Unknown Source) [?:?]
2025-04-22T09:19:08.185+01:00 ERROR [VersionProbe] Unable to retrieve version from indexer node RNGDC1PGLS01.rng.reddenorthgate.com:9200: unknown error - an exception occurred while deserializing error response: {}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Authentication': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (okio.Buffer$inputStream$1); line: 1, column: 16]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2572) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2598) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2606) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:765) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3659) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2747) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:867) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:753) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._initForReading(ObjectReader.java:357) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:2115) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1501) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:33) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:23) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeSingleHost$3(VersionProbe.java:149) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.rootResponse(VersionProbe.java:208) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeSingleHost(VersionProbe.java:159) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeAllHosts$2(VersionProbe.java:125) ~[graylog.jar:?]
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[?:?]
        at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.evaluate(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.findFirst(Unknown Source) ~[?:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeAllHosts(VersionProbe.java:127) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probe$1(VersionProbe.java:107) ~[graylog.jar:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) [graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:107) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:84) [graylog.jar:?]
        at org.graylog2.periodical.ESVersionCheckPeriodical.doRun(ESVersionCheckPeriodical.java:104) [graylog.jar:?]
        at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:99) [graylog.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.base/java.lang.Thread.run(Unknown Source) [?:?]
2025-04-22T09:19:08.186+01:00 INFO  [VersionProbe] Indexer is not available. Retry #3
2025-04-22T09:19:09.183+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS01.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #12).
2025-04-22T09:19:09.355+01:00 INFO  [Messages] Bulk indexing finally successful (attempt #13).
2025-04-22T09:19:11.085+01:00 INFO  [Messages] Bulk indexing finally successful (attempt #13).
2025-04-22T09:19:14.203+01:00 INFO  [Messages] Bulk indexing finally successful (attempt #13).

It’s worth noting the cluster claims to be healthy… Even though the server.log above says otherwise!

Tdvorak · April 22, 2025, 9:13am

Thanks for the logs. From the server log I see that in one moment auth is failing but in the next it’s happily bulk-indexing data.

If your password_secret is configured everywhere correctly (all places using the same value), then there is one more thing that can break the auth. Could you please verify that your time is in sync on all machines? JWT is relying on exact time and there are some buffers for token validity, but generally any difference larger than a few seconds could cause some requests to fail. I’d recommend using NTP everywhere, if you aren’t using it yet.

Thanks
Tomas

mc114 · April 22, 2025, 10:33am

@Tdvorak Thank you! Time is all over the place. It’s worst on the is_leader=true graylog node.
I’ll get that fixed up and report back.

mc114 · April 22, 2025, 1:07pm

So the next person to stumble upon this thread knows some more details - the culprit was time. The primary Graylog node was correct but the 2 Datanodes and the secondary server were on a separate time, 45 seconds in the past. I assume the JWT tokens got signed with a time in the future and it all got confused until the time was correct?

The issue seems to stem from our sprawling VMWare environment not havint NTP set up properly so i went through the VM’s and disabled time sync, then correctly configured chrony (which was a mission in iteself) to read from local domain DC’s instead of firewalled internet servers, and everything now has the same time to within a second and all the auth errors have gone away.

Thank you Tomas.

Tdvorak · April 22, 2025, 1:32pm

Happy to help! Thank you for your detailed explanation! We’re now discussing internally if there are some automated checks that would prevent such problems in the future.

Topic		Replies	Views
Failled to add new Datanode Graylog Central (peer support) data-node	25	467	June 5, 2025
Unable to run more than 1 datanode in a cluster, other datanodes are showing as unavailable Graylog Central (peer support) architecture , data-node	15	391	August 12, 2025
CA and CSR issues Graylog Central (peer support)	12	998	February 10, 2025
Datanode showing Unavailable Graylog Central (peer support) data-node	12	44	April 24, 2026
GRAYLOG 2.5 server currently unavailable Graylog Central (peer support)	8	1415	January 16, 2020

"Authentication Finally Failed"

Related topics