Elasticsearch (opensearch) nodes disk usage above flood stage watermark

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:
My graylog is throwing the error.

2. Describe your environment:

  • OS Information:
    Almalinux 9.5

  • Package Version:
    graylog-6.1-repository.noarch 1-1 @System
    graylog-datanode.x86_64 6.1.2-1 @graylog
    graylog-server.x86_64 6.1.2-1 @graylog

  • Service logs, configurations, and environment variables:

3. What steps have you already taken to try and solve the problem?
Google the shit out of it with no luck. I have deployed the graylog from this official link Red Hat Installation that uses datanode based on opensearch and not elasticsearch so the help links within the graylog do not help at all. I also looked in the open search site for some answers but none of the curl commands works. I can not get any basic info from the opensearch/graylog system. I only know that my storage is not full.

*Filesystem                                    Size  Used Avail Use% Mounted on*
*proc                                             0     0     0    - /proc*
*sysfs                                            0     0     0    - /sys*
*devtmpfs                                      4.0M     0  4.0M   0% /dev*
*securityfs                                       0     0     0    - /sys/kernel/security*
*tmpfs                                         3.8G     0  3.8G   0% /dev/shm*
*devpts                                           0     0     0    - /dev/pts*
*tmpfs                                         1.6G  8.6M  1.5G   1% /run*
*cgroup2                                          0     0     0    - /sys/fs/cgroup*
*pstore                                           0     0     0    - /sys/fs/pstore*
*bpf                                              0     0     0    - /sys/fs/bpf*
*/dev/mapper/almalinux_almalinuxtemplate-root   28G   17G   11G  61% /*
*systemd-1                                        -     -     -    - /proc/sys/fs/binfmt_misc*
*hugetlbfs                                        0     0     0    - /dev/hugepages*
*mqueue                                           0     0     0    - /dev/mqueue*
*tracefs                                          0     0     0    - /sys/kernel/tracing*
*debugfs                                          0     0     0    - /sys/kernel/debug*
*fusectl                                          0     0     0    - /sys/fs/fuse/connections*
*configfs                                         0     0     0    - /sys/kernel/config*
*none                                             0     0     0    - /run/credentials/systemd-sysctl.service*
*none                                             0     0     0    - /run/credentials/systemd-tmpfiles-setup-dev.service*
*/dev/sda1                                     960M  353M  608M  37% /boot*
*none                                             0     0     0    - /run/credentials/systemd-tmpfiles-setup.service*
*binfmt_misc                                      0     0     0    - /proc/sys/fs/binfmt_misc*
*tmpfs                                         769M     0  769M   0% /run/user/0*

from /var/log/graylog-datanode/opensearch/datanode-cluster.log

*L2-04T09:05:50,826][WARN ][o.o.c.r.a.DiskThresholdMonitor] [localhost] flood stage disk watermark [95%] exceeded on [GT6uLGlyTeqPPd0krcBBUA][localhost][/var/lib/graylog-datanode/opensearch/data/nodes/0] free: 992.7mb[3.4%], all indices on this node will be marked read-only*

*[2024-12-04T09:05:50,826][WARN ][o.o.c.r.a.DiskThresholdMonitor] [localhost] Putting index create block on cluster as all nodes are breaching high disk watermark. Number of nodes above high watermark: 1.*

4. How can the community help?

how can I TS the issue and how/why does it say it is not having free space when the disks are not full yet

Hi @ruzjio,
The problem you encounter is the opensearch snapshots search cache.

Datanode is by default configuring this cache to 10gb. The cache then consumes most of your free space and you see the watermark warnings.

If you aren’t using data tiering in graylog, you can reduce the cache size by the following setting in your datanode configuration file:

node_search_cache_size=1gb

(or any other supported unit).

Best regards,
Tomas

1 Like

Thx for quick replay.
I was reading thrue the links and have few additional question.
I’m using data tierieng as this is the default setting in current GrayLog when creating Indices. Legacy method is commented as deprecated.

So in reality the log space is limited by the cache that by default is only 10GB what is essentially the Hot Tier as this one is only availabe in the CE version right ?

When you are talking about datanode config are you refering to this file /etc/graylog/datanode/datanode.conf? If yes I assume I should just put it on the end of the config as I do not see any configuration on it in the file

Should I increase the size of the cache if I want to stay with default DataTiereng enabled (and have more space for logs) or switch to legacy. What are the downs of legacy ?

This is just the cache for optimizing query performance of searches in a repository (s3 or filesystem). These queries may be quite slow, so opensearch is using the cache to keep some information readily available. More cache leads to faster queries.

If you are using the opensource version and don’t have a license, there will be no warm tier available and you won’t be using the searchable snapshots. So you don’t need the cache.

The hot tier is a usual opensearch index, managed for you by the datanode. It’s not using this cache and the space for your logs is limited only by the available space on your disk.

There is a fresh 6.1.4 release from yesterday that disables the cache for you automatically, if it detects that you aren’t using searchable snapshots. You can update your installation and the problem will disappear for you, leaving you more free space on your disk for your logs.

Difference between the data tiering and legacy rotation & retention is in a way how you configure the disposal of the old data (and the warm tier that needs a license). With your limited space, you need a mechanism that deletes logs regularly. Both options will work for you.

Finally, /etc/graylog/datanode/datanode.conf is the correct location for changes of the datanode configuration.

Best regards,
Tomas

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.