Graylog DataNode Wrapper holding 300GB+ of deleted OpenSearch segments (File Descriptor Leak?)

Description of the Issue: We are running a Graylog DataNode setup and facing a recurring critical disk space issue. The disk fills up with “ghost” files marked as (deleted) by the OS that never release their disk space.

This accumulation is unbounded and continues for days. We have seen the usage grow to over 1TB, forcing us to constantly restart the service just to keep the server operational.

We have isolated the process holding these files, and uniquely, it is not the main OpenSearch engine process. It is the Graylog DataNode Wrapper/Controller process.

System Details:

  • Component: Graylog DataNode

  • JVM Settings (Wrapper): -Xms1g -Xmx1g -XX:+UseG1GC

  • Retention Strategy: Time-based (Min 30 days, Max 35 days).

The Symptoms:

  1. Retention jobs run successfully according to Graylog (indices disappear from the web interface).

  2. The OS unlinks the files (they are not visible in ls).

  3. However, disk space is not freed and usage climbs steadily, reaching 1TB+ if left unchecked.

  4. Checking /proc/[PID]/maps confirms that the DataNode Wrapper process is holding onto huge deleted OpenSearch index files (e.g., .dvd, .fdt, .tim segments).

Evidence & Troubleshooting Performed:

  • Process Identification:

    • PID 142547 (OpenSearch): 18GB Heap. (NOT holding the files).

    • PID 142008 (Graylog DataNode Wrapper): 1GB Heap. (IS holding the deleted files).

  • Output from our investigation script: (Snapshot showing ~150GB, though this grows indefinitely)

    Plaintext

    Calculating deleted file sizes for PID: 142008 ()...
    ==================================================================
    SIZE (Raw)   | SIZE (Hum)   | FILE
    ------------------------------------------------------------------
    122880       | 120.00 KB    | /var/lib/graylog-datanode/opensearch/config/native_libs/jna/jna6532890393412097211.tmp
    11656069120  | 10.86 GB     | /var/lib/graylog-datanode/opensearch/data/nodes/0/indices/UWucT_7vTJaQzLrUORAQZw/0/index/_1ug.fdt
    11571429376  | 10.78 GB     | /var/lib/graylog-datanode/opensearch/data/nodes/0/indices/_PGlBYz8QAefU_2yen160Q/0/index/_221.fdt
    11162910720  | 10.40 GB     | /var/lib/graylog-datanode/opensearch/data/nodes/0/indices/B_Lp4CQBSI22kDZJUD8_Hg/0/index/_204.fdt
    ... (hundreds of similar files) ...
    ==================================================================
    TOTAL        | 149.40 GB    | Total Reclaimable Space
    
    
  • Manual GC Attempt: We ran jcmd 142008 GC.run multiple times. The Full GC runs (verified via jstat), but disk space is NOT released. This suggests a strong reference leak, not just “lazy” G1GC behavior.

  • JNA Issues: We also see jna...tmp (deleted) files being held by this same wrapper process, which might indicate a native library issue.

  • Restart Behavior: Restarting graylog-datanode immediately frees the space, but the accumulation starts again with the next retention cycle.

Our Analysis: It appears the DataNode wrapper is opening file handles to read/monitor OpenSearch indices (perhaps for metrics or health checks) but fails to close these handles when OpenSearch rotates/deletes the underlying files. Because the wrapper has a strong lock, the kernel refuses to free the blocks.

Questions:

  1. Why would the DataNode Wrapper maintain open file handles to OpenSearch index segments?

  2. Since GC.run fails to clear them, is there a known defect in the DataNode monitoring thread or JNA interaction that causes these handles to become “zombie” resources?

  3. Is there a workaround to force the wrapper to release these handles without a full service restart?

Any insights or recommended configuration changes would be appreciated.

Hello, While waiting for my account to be activated/approved in the Graylog Community, I posted the details on the Graylog GitHub page and included additional information and debugging outputs.

They confirmed they were able to pinpoint the root cause and plan to include the fix in the next bugfix release.
Graylog 7 - migrated datanode does not free up disk space · Issue #23870 · Graylog2/graylog2-server

Thank,
Iman