"Authentication Finally Failed"

1. Describe your incident:
After running for 10 minutes, my leader graylgo node starts throwing authentication errors against my two Graylog-Datanode backend servers. There is a string of Java errors beginning with:
graylog-server Caused by: org.graylog.shaded.opensearch2.org.opensearch.client.ResponseException: method [GET], host [https://datanode-2:9200], URI [/p_f_237/_stats/store], status line [HTTP/1.1 401 Unauthorized]
and * graylog-server 2025-04-08T10:26:49.436+01:00 WARN [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://Datanode-1:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
2025-04-08 10:25:57.481*
And then a graylog-server Authentication finally failed

This happens every minute. The system is mostly working, though widgets / searches will occasionally return a 401 Unauthorized / Authentication finally failed" until refreshed.

The problem doesn’t seem to be present on the secondary graylog web node.

2. Describe your environment:

  • OS Information: RHEL 9.5
  • Package Version: graylog-server-6.1.10 & graylog-datanode-6.1.10
  • Service logs, configurations, and environment variables:
    I can zip up some logs and post them if needs be, but extended snippets are in the first reply:
    server.conf
    is_leader = true
    node_id_file = /etc/graylog/server/node-id
    password_secret = REMOVED
    root_password_sha2 = REMOVED
    bin_dir = /usr/share/graylog-server/bin
    data_dir = /opt/graylog-server
    plugin_dir = /usr/share/graylog-server/plugin
    http_bind_address = 10.181.144.15:9000
    http_publish_uri = https://graylog.domain.com:9000/
    http_enable_tls = true
    http_tls_cert_file = /etc/graylog/graylog.pem
    http_tls_key_file = /etc/graylog/graylog.key
    stream_aware_field_types=false
    disabled_retention_strategies = none,close
    allow_leading_wildcard_searches = false
    allow_highlighting = false
    field_value_suggestion_mode = on
    output_batch_size = 5000
    output_flush_interval = 1
    output_fault_count_threshold = 5
    output_fault_penalty_seconds = 30
    processor_wait_strategy = blocking
    ring_size = 65536
    inputbuffer_ring_size = 65536
    inputbuffer_wait_strategy = blocking
    message_journal_enabled = true
    message_journal_dir = /opt/graylog-server/journal
    message_journal_max_age = 24h
    message_journal_max_size = 60gb
    lb_recognition_period_seconds = 3
    mongodb_uri = mongodb://graylog_user:password@graylog1/graylog
    mongodb_max_connections = 1000
    transport_email_enabled = true
    transport_email_hostname = mailrelay.mail.com
    transport_email_port = 25
    transport_email_use_auth = false

datanode.conf
node_id_file = /etc/graylog/datanode/node-id
config_location = /etc/graylog/datanode
password_secret = REMOVED
root_password_sha2 = REMOVED
mongodb_uri = mongodb://graylog_user:password@graylog1/graylog
opensearch_location = /usr/share/graylog-datanode/dist
opensearch_config_location = /opt/graylog-datanode/opensearch/config
opensearch_data_location = /opt/graylog-datanode/opensearch/data
opensearch_logs_location = /var/log/graylog-datanode/opensearch
opensearch_heap = 24g

3. What steps have you already taken to try and solve the problem?
Checked logs, checked services, checked configs, all of these seem to be using the corect strings and SHA’s and passwords. Nothing seems to be different between the erroring and non-erroring graylog node except for the “is_leader” on the leader node.

4. How can the community help?

I’m not really sure why these errors are coming or what they mean and how I can fix them sorry, which is why i’m here!

Hi @mc114,
The communication between graylog server and data node is using JWT auth. JWT tokens are using password_secret and have some caches and timeouts, but none of it is matching your 10 minutes, they expire sooner. Otherwise the Authentication finally failed would signalize problems with JWT. Here are my suggestions to check:

  • Double-check that your password_secret is set to the very same value on every node.
  • Check data node logs if there are some opensearch stacktraces suggesting possible problems
  • Is there any proxy involved anywhere in the communication?
  • Could you post full stacktrace of the error, so I know which part of the code base is triggering the request?

Another thing, maybe not relevant to the problem - if you are running a two-node data node cluster, you should add at least one more node. Two nodes are prone to failures, split-brain issues and are not recommended setup.

Thanks,
Tomas

Hi @Tdvorak thanks for geting back to me. sorry for taking so long to reply - i’ve been on holiday !

I’ll see If i can get a 3rd node into the cluster, we have a dev front/backend setup on ‘free’ redhat dev licences which i could potentially break up and move one of the boxes into the live cluster.

I’ve been through the datanode logs and there are only several flavours of “authenticaion failed” in both the opensearch and datanode logs:

2025-04-22T08:28:56.431Z INFO [OpensearchProcessImpl] [2025-04-22T08:28:56,431][WARN ][o.o.s.a.BackendRegistry ] [RNGDC1PGLS01.rng.reddenorthgate.com] Authentication finally failed for null from 10.181.144.15:43942
2025-04-22T08:28:57.074Z INFO [OpensearchProcessImpl] [2025-04-22T08:28:57,074][WARN ][o.o.s.a.BackendRegistry ] [RNGDC1PGLS01.rng.reddenorthgate.com] Authentication finally failed for null from 10.181.144.15:43942
2025-04-22T08:28:57.389Z INFO [OpensearchProcessImpl] [2025-04-22T08:28:57,389][WARN ][o.o.s.a.BackendRegistry ] [RNGDC1PGLS01.rng.reddenorthgate.com] Authentication finally failed for null from 10.181.144.15:43942

and:

type": “server”, “timestamp”: “2025-04-22T08:25:52,590Z”, “level”: “WARN”, “component”: “o.o.s.a.BackendRegistry”, “cluster.name”: “datanode-cluster”, “node.name”: “RNGDC1PGLS01.rng.reddenorthgate.com”, “message”: “Authentication finally failed for null from 10.181.144.15:60236”, “cluster.uuid”: “zET-JFeuSkOn3YMjqGYrvw”, “node.id”: “KEYKdfuVTpytmfAAmlA1ZQ” }
{“type”: “server”, “timestamp”: “2025-04-22T08:25:52,770Z”, “level”: “WARN”, “component”: “o.o.s.a.BackendRegistry”, “cluster.name”: “datanode-cluster”, “node.name”: “RNGDC1PGLS01.rng.reddenorthgate.com”, “message”: “Authentication finally failed for null from 10.181.144.15:60236”, “cluster.uuid”: “zET-JFeuSkOn3YMjqGYrvw”, “node.id”: “KEYKdfuVTpytmfAAmlA1ZQ” }

The server.log is much more descriptive and has plenty of errors and stacktraces inside it. i’ve snipped a bit out here, and will work on uploading the full log once i can move it off the lan.


2025-04-22T09:19:04.968+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #8).
2025-04-22T09:19:05.184+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:05.306+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #9).
2025-04-22T09:19:05.903+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS01.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #10).
2025-04-22T09:19:06.080+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #12).
2025-04-22T09:19:06.184+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:07.042+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS02.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #11).
2025-04-22T09:19:07.180+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:08.162+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:08.162+01:00 WARN  [IndexRotationThread] Elasticsearch cluster isn't healthy. Skipping index rotation.
2025-04-22T09:19:08.180+01:00 ERROR [ClusterAdapterOS2] Check for connectivity failed with exception 'An error occurred: ' - enable debug level for this class to see the stack trace.
2025-04-22T09:19:08.184+01:00 ERROR [VersionProbe] Unable to retrieve version from indexer node RNGDC1PGLS02.rng.reddenorthgate.com:9200: unknown error - an exception occurred while deserializing error response: {}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Authentication': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (okio.Buffer$inputStream$1); line: 1, column: 16]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2572) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2598) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2606) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:765) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3659) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2747) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:867) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:753) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._initForReading(ObjectReader.java:357) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:2115) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1501) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:33) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:23) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeSingleHost$3(VersionProbe.java:149) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.rootResponse(VersionProbe.java:208) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeSingleHost(VersionProbe.java:159) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeAllHosts$2(VersionProbe.java:125) ~[graylog.jar:?]
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[?:?]
        at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.evaluate(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.findFirst(Unknown Source) ~[?:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeAllHosts(VersionProbe.java:127) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probe$1(VersionProbe.java:107) ~[graylog.jar:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) [graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:107) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:84) [graylog.jar:?]
        at org.graylog2.periodical.ESVersionCheckPeriodical.doRun(ESVersionCheckPeriodical.java:104) [graylog.jar:?]
        at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:99) [graylog.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.base/java.lang.Thread.run(Unknown Source) [?:?]
2025-04-22T09:19:08.185+01:00 ERROR [VersionProbe] Unable to retrieve version from indexer node RNGDC1PGLS01.rng.reddenorthgate.com:9200: unknown error - an exception occurred while deserializing error response: {}
com.fasterxml.jackson.core.JsonParseException: Unrecognized token 'Authentication': was expecting (JSON String, Number, Array, Object or token 'null', 'true' or 'false')
 at [Source: (okio.Buffer$inputStream$1); line: 1, column: 16]
        at com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:2572) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2598) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.JsonParser._constructReadException(JsonParser.java:2606) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:765) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._reportInvalidToken(UTF8StreamJsonParser.java:3659) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._handleUnexpectedValue(UTF8StreamJsonParser.java:2747) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser._nextTokenNotInObject(UTF8StreamJsonParser.java:867) ~[graylog.jar:?]
        at com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:753) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._initForReading(ObjectReader.java:357) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader._bindAndClose(ObjectReader.java:2115) ~[graylog.jar:?]
        at com.fasterxml.jackson.databind.ObjectReader.readValue(ObjectReader.java:1501) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:33) ~[graylog.jar:?]
        at retrofit2.converter.jackson.JacksonResponseBodyConverter.convert(JacksonResponseBodyConverter.java:23) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeSingleHost$3(VersionProbe.java:149) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.rootResponse(VersionProbe.java:208) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeSingleHost(VersionProbe.java:159) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probeAllHosts$2(VersionProbe.java:125) ~[graylog.jar:?]
        at java.base/java.util.stream.ReferencePipeline$3$1.accept(Unknown Source) ~[?:?]
        at java.base/java.util.ArrayList$ArrayListSpliterator.tryAdvance(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.forEachWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyIntoWithCancel(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.copyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(Unknown Source) ~[?:?]
        at java.base/java.util.stream.FindOps$FindOp.evaluateSequential(Unknown Source) ~[?:?]
        at java.base/java.util.stream.AbstractPipeline.evaluate(Unknown Source) ~[?:?]
        at java.base/java.util.stream.ReferencePipeline.findFirst(Unknown Source) ~[?:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probeAllHosts(VersionProbe.java:127) ~[graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.lambda$probe$1(VersionProbe.java:107) ~[graylog.jar:?]
        at com.github.rholder.retry.AttemptTimeLimiters$NoAttemptTimeLimit.call(AttemptTimeLimiters.java:78) [graylog.jar:?]
        at com.github.rholder.retry.Retryer.call(Retryer.java:160) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:107) [graylog.jar:?]
        at org.graylog2.storage.versionprobe.VersionProbe.probe(VersionProbe.java:84) [graylog.jar:?]
        at org.graylog2.periodical.ESVersionCheckPeriodical.doRun(ESVersionCheckPeriodical.java:104) [graylog.jar:?]
        at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:99) [graylog.jar:?]
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
        at java.base/java.lang.Thread.run(Unknown Source) [?:?]
2025-04-22T09:19:08.186+01:00 INFO  [VersionProbe] Indexer is not available. Retry #3
2025-04-22T09:19:09.183+01:00 WARN  [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://RNGDC1PGLS01.rng.reddenorthgate.com:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
Authentication finally failed];, errorDetails=[]}, retrying (attempt #12).
2025-04-22T09:19:09.355+01:00 INFO  [Messages] Bulk indexing finally successful (attempt #13).
2025-04-22T09:19:11.085+01:00 INFO  [Messages] Bulk indexing finally successful (attempt #13).
2025-04-22T09:19:14.203+01:00 INFO  [Messages] Bulk indexing finally successful (attempt #13).

It’s worth noting the cluster claims to be healthy… Even though the server.log above says otherwise!

Thanks for the logs. From the server log I see that in one moment auth is failing but in the next it’s happily bulk-indexing data.

If your password_secret is configured everywhere correctly (all places using the same value), then there is one more thing that can break the auth. Could you please verify that your time is in sync on all machines? JWT is relying on exact time and there are some buffers for token validity, but generally any difference larger than a few seconds could cause some requests to fail. I’d recommend using NTP everywhere, if you aren’t using it yet.

Thanks
Tomas

1 Like

@Tdvorak Thank you! Time is all over the place. It’s worst on the is_leader=true graylog node.
I’ll get that fixed up and report back.

So the next person to stumble upon this thread knows some more details - the culprit was time. The primary Graylog node was correct but the 2 Datanodes and the secondary server were on a separate time, 45 seconds in the past. I assume the JWT tokens got signed with a time in the future and it all got confused until the time was correct?

The issue seems to stem from our sprawling VMWare environment not havint NTP set up properly so i went through the VM’s and disabled time sync, then correctly configured chrony (which was a mission in iteself) to read from local domain DC’s instead of firewalled internet servers, and everything now has the same time to within a second and all the auth errors have gone away.

Thank you Tomas.

2 Likes

Happy to help! Thank you for your detailed explanation! We’re now discussing internally if there are some automated checks that would prevent such problems in the future.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.