Yet another "flood stage disk watermark [95%] exceeded"

stev-io · November 19, 2024, 8:22pm

Hi Graylog community,

I am struggeling regarding the following error:
[graylog-datanode] flood stage disk watermark [95%] exceeded on [VzTn69eQTOqG5WoMHOiI9Q][graylog-datanode][/var/lib/graylog-datanode/opensearch/data/nodes/0] free: 18.2mb[0.1%], all indices on this node will be marked read-only

I know what it tries to tell me. But I cannot grasp WHY the datanode tells me so. I am running the Graylog stack as docker containers. More details down below. For the moment, let’s stay with the error/symtpoms:

It says: free: 18.2mb[0.1%]

When I “df -h .” within the mentioned directory it says:

root@graylog-datanode:/var/lib/graylog-datanode/opensearch/data/nodes/0# df -h .
Filesystem                        Size  Used Avail Use% Mounted on
/dev/mapper/pve-vm--131--disk--0   16G  7.6G  7.3G  52% /var/lib/graylog-datanode

I would say, plenty of free space. In absolute and relative numbers.

Furthermore: When I completely reset the environment there is like ~500 MB more space on the device. The containers then start up perfectly, I can go through the setup process of connecting the datanode from within Graylog and it is useable for a while. I can use it like a charm in this time window. And out of sudden the indexes get locked up because of the above error.
How can I troubleshoot that any further?

My system environment:

LXC Container (Proxmox 8.2.8)
Docker version 27.3.1
Graylog 6.1
Graylog Datanode 6.1

Wine_Merchant · November 21, 2024, 10:08am

Helllo @stev-io,

There are a couple of levels with LXC Container and Docker, I wonder if the issue rests somewhere here. I’ve seen the issue crop up here previously and usually it is with Docker or LXC or both. There is yet a resolve.

Tdvorak · November 21, 2024, 12:39pm

Hello @stev-io,
Could you please check and tell us what /_nodes/stats/fs endpoint returns?

The easiest solution is to use client certificates and following command:

curl -XGET  -k  https://your-datanode:9200/_nodes/stats/fs --key client-cert.key --cert client-cert.crt --cacert ca.cert

Thanks, Tomas

stev-io · November 21, 2024, 5:09pm

Hi, thanks for responding! I’ll check. But it seams like I need to reset the environment because the Graylog server now refuses UI access (i guess as a result of the datanode refusing connections… as a result of no space left… (?)).

stev-io · November 21, 2024, 5:37pm

Okay, resettet the complete environment, here is my attempt:

First, to see that the connection is working a simple request with answer:

graylog:~$ curl "https://localhost:9200/_cluster/health?pretty" -k --cert client-cert.crt --key client-cert.key
{
  "cluster_name" : "datanode-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "discovered_master" : true,
  "discovered_cluster_manager" : true,
  "active_primary_shards" : 6,
  "active_shards" : 6,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

And executing your request:

graylog:~$ curl "https://localhost:9200/_nodes/stats/fs?pretty" -k --cert client-cert.crt --key client-cert.key
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "Values less than -1 bytes are not supported: -2199552b"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "Values less than -1 bytes are not supported: -2199552b",
    "suppressed" : [
      {
        "type" : "illegal_state_exception",
        "reason" : "Failed to close the XContentBuilder",
        "caused_by" : {
          "type" : "i_o_exception",
          "reason" : "Unclosed object or array found"
        }
      }
    ]
  },
  "status" : 400
}

stev-io · November 21, 2024, 6:34pm

Ok, seems like it was still in an intermediate state before it was really operational when I first tried to send a request.

Now it is reporting something:

graylog:~$ curl "https://localhost:9200/_nodes/stats/fs?pretty" -k --cert client-cert.crt --key client-cert.key
{
  "_nodes" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "cluster_name" : "datanode-cluster",
  "nodes" : {
    "P6E3FqbJSWK7Ffa-6LR-Vg" : {
      "timestamp" : 1732213852711,
      "name" : "graylog-datanode",
      "transport_address" : "172.18.0.5:9300",
      "host" : "graylog-datanode",
      "ip" : "172.18.0.5:9300",
      "roles" : [
        "cluster_manager",
        "data",
        "ingest",
        "remote_cluster_client",
        "search"
      ],
      "attributes" : {
        "shard_indexing_pressure_enabled" : "true"
      },
      "fs" : {
        "timestamp" : 1732213852712,
        "total" : {
          "total_in_bytes" : 16729894912,
          "free_in_bytes" : 9444036608,
          "available_in_bytes" : 151842816,
          "cache_reserved_in_bytes" : 8416423936,
          "cache_utilized" : 0
        },
        "data" : [
          {
            "path" : "/var/lib/graylog-datanode/opensearch/data/nodes/0",
            "mount" : "/var/lib/graylog-datanode (/dev/mapper/pve-vm--131--disk--0)",
            "type" : "ext4",
            "total_in_bytes" : 16729894912,
            "free_in_bytes" : 9444036608,
            "available_in_bytes" : 151842816,
            "cache_reserved_in_bytes" : 8416423936,
            "cache_utilized" : 0
          }
        ],
        "io_stats" : {
          "devices" : [
            {
              "device_name" : "dm-21",
              "operations" : 126909,
              "read_operations" : 35107,
              "write_operations" : 91802,
              "read_kilobytes" : 536828,
              "write_kilobytes" : 410552,
              "read_time" : 19417,
              "write_time" : 126030,
              "queue_size" : 145447,
              "io_time_in_millis" : 32914
            }
          ],
          "total" : {
            "operations" : 126909,
            "read_operations" : 35107,
            "write_operations" : 91802,
            "read_kilobytes" : 536828,
            "write_kilobytes" : 410552,
            "read_time" : 19417,
            "write_time" : 126030,
            "queue_size" : 145447,
            "io_time_in_millis" : 32914
          }
        }
      }
    }
  }
}

And it also reached the flood watermark in the meanwhile:

2024-11-21T18:33:54.616Z INFO  [OpensearchProcessImpl] [2024-11-21T18:33:54,616][WARN ][o.o.c.r.a.DiskThresholdMonitor] [graylog-datanode] flood stage disk watermark [95%] exceeded on [P6E3FqbJSWK7Ffa-6LR-Vg][graylog-datanode][/var/lib/graylog-datanode/opensearch/data/nodes/0] free: 144.6mb[0.9%], all indices on this node will be marked read-only

stev-io · November 21, 2024, 6:41pm

So that means that the reported available bytes are the issue here … They are the 0.9% that Graylog is complaining about.

stev-io · November 21, 2024, 6:54pm

from the documentation I see that available_in_bytes is determined by Javas getUsableSpace() which queries the remaining space of the JVM heap size.

So I guess I need to adjust the max. heap size to solve this problem? But why is flood water mark based on the size of the JVM heap size and not on the drive space?

Tdvorak · November 22, 2024, 8:18am

The difference between free_in_bytes and available_in_bytes could be permissions or any limitations applied to current JVM or user that’s running the datanode.

Could it be that your FS has some reserved space for the root user, which appears to be free (by the df command) but is not usable for the datanode jvm/user?

stev-io · November 22, 2024, 9:52am

Hi @Tdvorak thanks again for stepping by

Just to be sure we’re talking about the same: From what I read here I thought the JVM heap size (that I indead limit via GRAYLOG_DATANODE_OPENSEARCH_HEAP) is the problem. But on the other side that makes no sense since this is not disc space related.

So you think the JVM may suffer from another limit that is indeed disc space related? Would make sense. Unfortunately I am not a storage expert and need to see what else the reason might be.
Regarding your suggestions: Do you have any hints about how I could check that?

BTW this is the JVM related state from the node:

"jvm" : {
        "pid" : 76,
        "version" : "21.0.3",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "21.0.3+9-LTS",
        "vm_vendor" : "Eclipse Adoptium",
        "bundled_jdk" : true,
        "using_bundled_jdk" : true,
        "start_time_in_millis" : 1732210171166,
        "mem" : {
          "heap_init_in_bytes" : 3670016000,
          "heap_max_in_bytes" : 3670016000,
          "non_heap_init_in_bytes" : 7667712,
          "non_heap_max_in_bytes" : 0,
          "direct_max_in_bytes" : 0
        },
        "gc_collectors" : [
          "G1 Young Generation",
          "G1 Concurrent GC",
          "G1 Old Generation"
        ],
        "memory_pools" : [
          "CodeHeap 'non-nmethods'",
          "Metaspace",
          "CodeHeap 'profiled nmethods'",
          "Compressed Class Space",
          "G1 Eden Space",
          "G1 Old Gen",
          "G1 Survivor Space",
          "CodeHeap 'non-profiled nmethods'"
        ],
        "using_compressed_ordinary_object_pointers" : "true",
        "input_arguments" : [
          "-Xshare:auto",
          "-Dopensearch.networkaddress.cache.ttl=60",
          "-Dopensearch.networkaddress.cache.negative.ttl=10",
          "-XX:+AlwaysPreTouch",
          "-Xss1m",
          "-Djava.awt.headless=true",
          "-Dfile.encoding=UTF-8",
          "-Djna.nosys=true",
          "-XX:-OmitStackTraceInFastThrow",
          "-XX:+ShowCodeDetailsInExceptionMessages",
          "-Dio.netty.noUnsafe=true",
          "-Dio.netty.noKeySetOptimization=true",
          "-Dio.netty.recycler.maxCapacityPerThread=0",
          "-Dio.netty.allocator.numDirectArenas=0",
          "-Dlog4j.shutdownHookEnabled=false",
          "-Dlog4j2.disable.jmx=true",
          "-Djava.security.manager=allow",
          "-Djava.locale.providers=SPI,COMPAT",
          "-Xms1g",
          "-Xmx1g",
          "-XX:+UseG1GC",
          "-XX:G1ReservePercent=25",
          "-XX:InitiatingHeapOccupancyPercent=30",
          "-Djava.io.tmpdir=/tmp/opensearch-18366132746827840855",
          "-XX:+HeapDumpOnOutOfMemoryError",
          "-XX:HeapDumpPath=data",
          "-XX:ErrorFile=/tmp/hs_err_pid%p.log",
          "-Xlog:gc*,gc+age=trace,safepoint:file=/tmp/gc.log:utctime,pid,tags:filecount=32,filesize=64m",
          "-Djava.security.manager=allow",
          "-Xms3500M",
          "-Xmx3500M",
          "-Djavax.net.ssl.trustStore=/var/lib/graylog-datanode/opensearch/config/opensearch/datanode-truststore.p12",
          "-Djavax.net.ssl.trustStorePassword=G...kC",
          "-Djavax.net.ssl.trustStoreType=pkcs12",
          "-XX:MaxDirectMemorySize=1835008000",
          "-Dopensearch.path.home=/usr/share/graylog-datanode/dist/opensearch-2.15.0-linux-x64",
          "-Dopensearch.path.conf=/var/lib/graylog-datanode/opensearch/config/opensearch",
          "-Dopensearch.distribution.type=tar",
          "-Dopensearch.bundled_jdk=true"
        ]
      }

kpearson · November 22, 2024, 12:32pm

From what I can see, the search role applied to the data node seems to be the issue. You have 16GB of disk space, and based on the metrics, you have used about 6.5GB with 9.5GB left. However, when the search role is applied, you are trying to reserve 10GB for the search node cache by default, making the system report it is out of space even if it’s not being used.

Tdvorak · November 22, 2024, 2:05pm

Just to extend what @kpearson already recommended, here’s the configuration property for the datanode. Try to set this to, let’s say 1G:

node_search_cache_size=1G

This configuration controls following opensearch setting and is in datanode set by default to 10G: https://opensearch.org/docs/latest/tuning-your-cluster/availability-and-recovery/snapshots/searchable_snapshot/#configuring-a-node-to-use-searchable-snapshots

Try to restart the datanode and check if the available space increased and the warnings are gone.

stev-io · November 22, 2024, 2:26pm

Hey ho,

thank you for that one. I’ll check. I saw that with the default config the actual size was about 8GB, certainly because it was not able to grow up to 10GB.

Unfortunately whenever the flood watermark is hit a restart results in crashes of the datanode. It won’t start up then anymore. The error before the startup attempt fails refers to the flood watermark. Is there a chance to work around that?

For now I reset the environment and I will have a look how it behaves.

THANKS

stev-io · November 22, 2024, 2:38pm

Cache is now at 2GB:

https://localhost:9200/_nodes/stats/file_cache?pretty
"file_cache" : {
      "timestamp" : 1732285874992,
      "active_in_bytes" : 0,
      "total_in_bytes" : 2147483648,
      "used_in_bytes" : 0,
      "evictions_in_bytes" : 0,
      "active_percent" : 0,
      "used_percent" : 0,
      "hit_count" : 0,
      "miss_count" : 0
    }

And filesystem reports:

"fs" : {
        "timestamp" : 1732286045182,
        "total" : {
          "total_in_bytes" : 16729894912,
          "free_in_bytes" : 9284390912,
          "available_in_bytes" : 6261137408,
          "cache_reserved_in_bytes" : 2147483648,
          "cache_utilized" : 0
        }

So now there is about 37% usable space left. Seems like that was the solution

THANK you very much!

Maybe one of you could have a look at my final question in the answer before but thats not crucial anymore!

Tdvorak · November 22, 2024, 2:52pm

Awesome! You are right, the recovery after flood watermark is causing problems during the startup and there is no simple solution to remove the block during that time. We are working on it.

system · December 6, 2024, 2:52pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Datanode refuse to start after upgrade Graylog Central (peer support)	6	259	July 30, 2024
Elasticsearch nodes disk usage above low watermark Graylog Central (peer support) data-node	5	259	November 26, 2024
Graylog dashboard is looking good due to below error Graylog Central (peer support)	5	625	June 3, 2021
Ran out of space - got hard drive now can't start UI Graylog Central (peer support) documentation	9	873	April 20, 2023
OpenSearch failed to start after reaching high watermark Graylog Central (peer support) docker	2	416	May 25, 2024

Yet another "flood stage disk watermark [95%] exceeded"

Related topics