Elasticsearch cluster datanode-cluster is red. Shards: 1 unassigned

1. Describe your incident:
I’m using Graylog Datanode and the server stopped saving new logs because
Elasticsearch cluster datanode-cluster is red. Shards: 40 active, 0 initializing, 0 relocating, 1 unassigned.
I didn’t do any change to the cluster; there’s plenty of disk space

Filesystem      Size  Used Avail Use% Mounted on
/dev/sda3        94G   29G   61G  33% /

and there’s plenty of RAM memory

              total        used        free      shared  buff/cache   available
Mem:           5.8G        3.8G      488.0M      588.0K        1.5G        1.8G
Swap:          4.0G           0        4.0G

2. Describe your environment:

  • OS Information: Alpine Linux 3.20 running as virtual machine on Proxmox VE (with daily backups using PBS)

  • Package Version: MongoDB 5.0, Graylog Enterprise 6.0, Graylog Datanode 6.0

  • Service logs, configurations, and environment variables:
    running on Docker Compose

version: "3.8"

services:
  mongodb:
    hostname: "mongodb"
    image: "mongo:5.0"
    volumes:
      - "mongodb_data:/data/db"
    restart: "no"
    networks:
      - graylog_network

  datanode:
    image: "${DATANODE_IMAGE:-graylog/graylog-datanode:6.0}"
    depends_on:
      mongodb:
        condition: "service_started"
    hostname: "datanode"
    env_file:
        - stack.env
    environment:
      GRAYLOG_DATANODE_NODE_ID_FILE: "/var/lib/graylog-datanode/node-id"
      GRAYLOG_DATANODE_MONGODB_URI: "mongodb://mongodb:27017/graylog"
    ulimits:
      memlock:
        hard: -1
        soft: -1
      nofile:
        soft: 65536
        hard: 65536
    ports:
      - "8999:8999/tcp"   # DataNode API
      - "9200:9200/tcp"
      - "9300:9300/tcp"
    volumes:
      - "graylog-datanode:/var/lib/graylog-datanode"
      - "/opt/geodb:/opt/geodb"
    restart: "no"
    networks:
      - graylog_network

  graylog:
    hostname: "server"
    image: "${GRAYLOG_IMAGE:-graylog/graylog-enterprise:6.0}"
    depends_on:
      mongodb:
        condition: "service_started"
    entrypoint: "/usr/bin/tini --  /docker-entrypoint.sh"
    env_file:
        - stack.env
    environment:
      GRAYLOG_NODE_ID_FILE: "/usr/share/graylog/data/data/node-id"
      GRAYLOG_HTTP_BIND_ADDRESS: "0.0.0.0:9000"
      GRAYLOG_HTTP_EXTERNAL_URI: "http://localhost:9000/"
      GRAYLOG_MONGODB_URI: "mongodb://mongodb:27017/graylog"
      # To make reporting (headless_shell) work inside a Docker container
      GRAYLOG_REPORT_DISABLE_SANDBOX: "true"
    ports:
    - "5140:5140/udp"   # OPNsense logs
    - "5142:5142/udp"   # Linux logs
    - "9000:9000/tcp"   # Server API
    volumes:
      - "graylog_data:/usr/share/graylog/data/data"
      - "graylog_journal:/usr/share/graylog/data/journal"
      - "/opt/geodb:/opt/geodb"
    restart: "no"
    networks:
      - graylog_network

volumes:
  mongodb_data:
  graylog-datanode:
  graylog_data:
  graylog_journal:
  
networks:
  graylog_network:
    driver: bridge

3. What steps have you already taken to try and solve the problem?
I tried to generate client certificates to connect using Curl to Opensearch/Elasticsearch (following this document: Graylog Data Node - Getting Started) but I get Authentication finally failed.
Also tried to restart the virtual machine multiple times.

4. How can the community help?
Provide instruction on how to resolve this issue.

Hi @mlazzarotto can you share the curl command you used that returned Authentication finally failed?

I believe this means the curl request didn’t connect using the issued client certificate.

Hi @drewmiranda-gl,
this is the command that I used:
curl -v "https://localhost:9200/_cluster/health?pretty" -k --cert datanode_certificate --key datanode_private
The datanode_certificate and datanode_private are valid files that I’ve obtained from the Graylog web interface.
When I run that command, the datanode docker container logs show this message:
2024-08-08T18:32:00.145Z INFO [OpensearchProcessImpl] [2024-08-08T18:32:00,143][WARN ][o.o.s.a.BackendRegistry ] [datanode] Authentication finally failed for null from 172.19.0.1:50992

I just ran through a test to see i can reproduce this:

  • Used the docker compose file you provided (with minor edits such as adding back the missing envvars and using a different env file)
  • Ran through the first run screens for datanode
  • Generated a client certificate
  • Used the copy buttons to save the 3 certs to files (though i don’t need the CA cert)
  • Executed the curl command exacly as you listed but used different file names:
    • curl -v "https://localhost:9200/_cluster/health?pretty" -k --cert cert.crt --key cert.key

Doing so does successfully return:

{
  "cluster_name" : "datanode-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 11,
  "active_shards" : 11,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

What response are you getting back from the curl command? Does it return an HTTP code?

Are you able to verify the cert files using (borrowed from here):

openssl x509 -noout -modulus -in cert.crt | openssl md5 > /tmp/crt.pub
openssl rsa -noout -modulus -in cert.key | openssl md5 > /tmp/key.pub
diff /tmp/crt.pub /tmp/key.pub

Thanks!

Hi Drew,

What response are you getting back from the curl command? Does it return an HTTP code?

“Authentication finally failed”
“401 Unauthorized”
This is the full verbose output:

* Host logservernew.lab.mydomain.it:9200 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:9200...
* Connected to logservernew.lab.mydomain.it (::1) port 9200
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Request CERT (13):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Certificate (11):
* TLSv1.3 (OUT), TLS handshake, CERT verify (15):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / x25519 / RSASSA-PSS
* ALPN: server did not agree on a protocol. Uses default.
* Server certificate:
*  subject: CN=datanode
*  start date: Jul 17 21:22:26 2024 GMT
*  expire date: Jul 17 21:22:26 2025 GMT
*  issuer: CN=Graylog CA
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* using HTTP/1.x
> GET /_cluster/health?pretty HTTP/1.1
> Host: logservernew.lab.mydomain.it:9200
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/1.1 401 Unauthorized
< content-type: text/plain; charset=UTF-8
< content-length: 29
<
* Connection #0 to host logservernew.lab.mydomain.it left intact
Authentication finally failed

Are you able to verify the cert files using (borrowed from here):

No, the command openssl rsa -noout -modulus -in cert.key | openssl md5 > /tmp/key.pub returns this error:Could not find private key from certificate.txt and that makes sense because the certificate should not contain the private key.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.