"Authentication Finally Failed"

1. Describe your incident:
After running for 10 minutes, my leader graylgo node starts throwing authentication errors against my two Graylog-Datanode backend servers. There is a string of Java errors beginning with:
graylog-server Caused by: org.graylog.shaded.opensearch2.org.opensearch.client.ResponseException: method [GET], host [https://datanode-2:9200], URI [/p_f_237/_stats/store], status line [HTTP/1.1 401 Unauthorized]
and * graylog-server 2025-04-08T10:26:49.436+01:00 WARN [Messages] Caught exception during bulk indexing: ElasticsearchException{message=OpenSearchException[An error occurred: ]; nested: OpenSearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [https://Datanode-1:9200], URI [/_bulk?timeout=1m], status line [HTTP/1.1 401 Unauthorized]
2025-04-08 10:25:57.481*
And then a graylog-server Authentication finally failed

This happens every minute. The system is mostly working, though widgets / searches will occasionally return a 401 Unauthorized / Authentication finally failed" until refreshed.

The problem doesn’t seem to be present on the secondary graylog web node.

2. Describe your environment:

  • OS Information: RHEL 9.5
  • Package Version: graylog-server-6.1.10 & graylog-datanode-6.1.10
  • Service logs, configurations, and environment variables:
    I can zip up some logs and post them if needs be, but extended snippets are in the first reply:
    server.conf
    is_leader = true
    node_id_file = /etc/graylog/server/node-id
    password_secret = REMOVED
    root_password_sha2 = REMOVED
    bin_dir = /usr/share/graylog-server/bin
    data_dir = /opt/graylog-server
    plugin_dir = /usr/share/graylog-server/plugin
    http_bind_address = 10.181.144.15:9000
    http_publish_uri = https://graylog.domain.com:9000/
    http_enable_tls = true
    http_tls_cert_file = /etc/graylog/graylog.pem
    http_tls_key_file = /etc/graylog/graylog.key
    stream_aware_field_types=false
    disabled_retention_strategies = none,close
    allow_leading_wildcard_searches = false
    allow_highlighting = false
    field_value_suggestion_mode = on
    output_batch_size = 5000
    output_flush_interval = 1
    output_fault_count_threshold = 5
    output_fault_penalty_seconds = 30
    processor_wait_strategy = blocking
    ring_size = 65536
    inputbuffer_ring_size = 65536
    inputbuffer_wait_strategy = blocking
    message_journal_enabled = true
    message_journal_dir = /opt/graylog-server/journal
    message_journal_max_age = 24h
    message_journal_max_size = 60gb
    lb_recognition_period_seconds = 3
    mongodb_uri = mongodb://graylog_user:password@graylog1/graylog
    mongodb_max_connections = 1000
    transport_email_enabled = true
    transport_email_hostname = mailrelay.mail.com
    transport_email_port = 25
    transport_email_use_auth = false

datanode.conf
node_id_file = /etc/graylog/datanode/node-id
config_location = /etc/graylog/datanode
password_secret = REMOVED
root_password_sha2 = REMOVED
mongodb_uri = mongodb://graylog_user:password@graylog1/graylog
opensearch_location = /usr/share/graylog-datanode/dist
opensearch_config_location = /opt/graylog-datanode/opensearch/config
opensearch_data_location = /opt/graylog-datanode/opensearch/data
opensearch_logs_location = /var/log/graylog-datanode/opensearch
opensearch_heap = 24g

3. What steps have you already taken to try and solve the problem?
Checked logs, checked services, checked configs, all of these seem to be using the corect strings and SHA’s and passwords. Nothing seems to be different between the erroring and non-erroring graylog node except for the “is_leader” on the leader node.

4. How can the community help?

I’m not really sure why these errors are coming or what they mean and how I can fix them sorry, which is why i’m here!

Hi @mc114,
The communication between graylog server and data node is using JWT auth. JWT tokens are using password_secret and have some caches and timeouts, but none of it is matching your 10 minutes, they expire sooner. Otherwise the Authentication finally failed would signalize problems with JWT. Here are my suggestions to check:

  • Double-check that your password_secret is set to the very same value on every node.
  • Check data node logs if there are some opensearch stacktraces suggesting possible problems
  • Is there any proxy involved anywhere in the communication?
  • Could you post full stacktrace of the error, so I know which part of the code base is triggering the request?

Another thing, maybe not relevant to the problem - if you are running a two-node data node cluster, you should add at least one more node. Two nodes are prone to failures, split-brain issues and are not recommended setup.

Thanks,
Tomas