Yet another "flood stage disk watermark [95%] exceeded"

Hi Graylog community,

I am struggeling regarding the following error:
[graylog-datanode] flood stage disk watermark [95%] exceeded on [VzTn69eQTOqG5WoMHOiI9Q][graylog-datanode][/var/lib/graylog-datanode/opensearch/data/nodes/0] free: 18.2mb[0.1%], all indices on this node will be marked read-only

I know what it tries to tell me. But I cannot grasp WHY the datanode tells me so. I am running the Graylog stack as docker containers. More details down below. For the moment, let’s stay with the error/symtpoms:

It says: free: 18.2mb[0.1%]

When I “df -h .” within the mentioned directory it says:

root@graylog-datanode:/var/lib/graylog-datanode/opensearch/data/nodes/0# df -h .
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/pve-vm--131--disk--0   16G  7.6G  7.3G  52% /var/lib/graylog-datanode

I would say, plenty of free space. In absolute and relative numbers.

Furthermore: When I completely reset the environment there is like ~500 MB more space on the device. The containers then start up perfectly, I can go through the setup process of connecting the datanode from within Graylog and it is useable for a while. I can use it like a charm in this time window. And out of sudden the indexes get locked up because of the above error.
How can I troubleshoot that any further?

My system environment:

LXC Container (Proxmox 8.2.8)
Docker version 27.3.1
Graylog 6.1
Graylog Datanode 6.1

Helllo @stev-io,

There are a couple of levels with LXC Container and Docker, I wonder if the issue rests somewhere here. I’ve seen the issue crop up here previously and usually it is with Docker or LXC or both. There is yet a resolve.

1 Like

Hello @stev-io,
Could you please check and tell us what /_nodes/stats/fs endpoint returns?

The easiest solution is to use client certificates and following command:

curl -XGET  -k  https://your-datanode:9200/_nodes/stats/fs --key client-cert.key --cert client-cert.crt --cacert ca.cert

Thanks, Tomas

1 Like

Hi, thanks for responding! I’ll check. But it seams like I need to reset the environment because the Graylog server now refuses UI access (i guess as a result of the datanode refusing connections… as a result of no space left… (?)).

Okay, resettet the complete environment, here is my attempt:

First, to see that the connection is working a simple request with answer:

graylog:~$ curl "https://localhost:9200/_cluster/health?pretty" -k --cert client-cert.crt --key client-cert.key
{
  "cluster_name" : "datanode-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 6,
  "active_shards" : 6,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

And executing your request:

graylog:~$ curl "https://localhost:9200/_nodes/stats/fs?pretty" -k --cert client-cert.crt --key client-cert.key
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Values less than -1 bytes are not supported: -2199552b"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Values less than -1 bytes are not supported: -2199552b",
    "suppressed" : [
      {
        "type" : "illegal_state_exception",
        "reason" : "Failed to close the XContentBuilder",
        "caused_by" : {
          "type" : "i_o_exception",
          "reason" : "Unclosed object or array found"
        }
      }
    ]
  },
  "status" : 400
}

Ok, seems like it was still in an intermediate state before it was really operational when I first tried to send a request.

Now it is reporting something:

graylog:~$ curl "https://localhost:9200/_nodes/stats/fs?pretty" -k --cert client-cert.crt --key client-cert.key
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "datanode-cluster",
  "nodes" : {
    "P6E3FqbJSWK7Ffa-6LR-Vg" : {
      "timestamp" : 1732213852711,
      "name" : "graylog-datanode",
      "transport_address" : "172.18.0.5:9300",
      "host" : "graylog-datanode",
      "ip" : "172.18.0.5:9300",
      "roles" : [
        "cluster_manager",
        "data",
        "ingest",
        "remote_cluster_client",
        "search"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "fs" : {
        "timestamp" : 1732213852712,
        "total" : {
          "total_in_bytes" : 16729894912,
          "free_in_bytes" : 9444036608,
          "available_in_bytes" : 151842816,
          "cache_reserved_in_bytes" : 8416423936,
          "cache_utilized" : 0
        },
        "data" : [
          {
            "path" : "/var/lib/graylog-datanode/opensearch/data/nodes/0",
            "mount" : "/var/lib/graylog-datanode (/dev/mapper/pve-vm--131--disk--0)",
            "type" : "ext4",
            "total_in_bytes" : 16729894912,
            "free_in_bytes" : 9444036608,
            "available_in_bytes" : 151842816,
            "cache_reserved_in_bytes" : 8416423936,
            "cache_utilized" : 0
          }
        ],
        "io_stats" : {
          "devices" : [
            {
              "device_name" : "dm-21",
              "operations" : 126909,
              "read_operations" : 35107,
              "write_operations" : 91802,
              "read_kilobytes" : 536828,
              "write_kilobytes" : 410552,
              "read_time" : 19417,
              "write_time" : 126030,
              "queue_size" : 145447,
              "io_time_in_millis" : 32914
            }
          ],
          "total" : {
            "operations" : 126909,
            "read_operations" : 35107,
            "write_operations" : 91802,
            "read_kilobytes" : 536828,
            "write_kilobytes" : 410552,
            "read_time" : 19417,
            "write_time" : 126030,
            "queue_size" : 145447,
            "io_time_in_millis" : 32914
          }
        }
      }
    }
  }
}

And it also reached the flood watermark in the meanwhile:

2024-11-21T18:33:54.616Z INFO  [OpensearchProcessImpl] [2024-11-21T18:33:54,616][WARN ][o.o.c.r.a.DiskThresholdMonitor] [graylog-datanode] flood stage disk watermark [95%] exceeded on [P6E3FqbJSWK7Ffa-6LR-Vg][graylog-datanode][/var/lib/graylog-datanode/opensearch/data/nodes/0] free: 144.6mb[0.9%], all indices on this node will be marked read-only

So that means that the reported available bytes are the issue here … They are the 0.9% that Graylog is complaining about.