Description of the Issue: We are running a Graylog DataNode setup and facing a recurring critical disk space issue. The disk fills up with “ghost” files marked as (deleted) by the OS that never release their disk space.
This accumulation is unbounded and continues for days. We have seen the usage grow to over 1TB, forcing us to constantly restart the service just to keep the server operational.
We have isolated the process holding these files, and uniquely, it is not the main OpenSearch engine process. It is the Graylog DataNode Wrapper/Controller process.
System Details:
-
Component: Graylog DataNode
-
JVM Settings (Wrapper):
-Xms1g -Xmx1g -XX:+UseG1GC -
Retention Strategy: Time-based (Min 30 days, Max 35 days).
The Symptoms:
-
Retention jobs run successfully according to Graylog (indices disappear from the web interface).
-
The OS unlinks the files (they are not visible in
ls). -
However, disk space is not freed and usage climbs steadily, reaching 1TB+ if left unchecked.
-
Checking
/proc/[PID]/mapsconfirms that the DataNode Wrapper process is holding onto huge deleted OpenSearch index files (e.g.,.dvd,.fdt,.timsegments).
Evidence & Troubleshooting Performed:
-
Process Identification:
-
PID 142547 (OpenSearch): 18GB Heap. (NOT holding the files).
-
PID 142008 (Graylog DataNode Wrapper): 1GB Heap. (IS holding the deleted files).
-
-
Output from our investigation script: (Snapshot showing ~150GB, though this grows indefinitely)
Plaintext
Calculating deleted file sizes for PID: 142008 ()... ================================================================== SIZE (Raw) | SIZE (Hum) | FILE ------------------------------------------------------------------ 122880 | 120.00 KB | /var/lib/graylog-datanode/opensearch/config/native_libs/jna/jna6532890393412097211.tmp 11656069120 | 10.86 GB | /var/lib/graylog-datanode/opensearch/data/nodes/0/indices/UWucT_7vTJaQzLrUORAQZw/0/index/_1ug.fdt 11571429376 | 10.78 GB | /var/lib/graylog-datanode/opensearch/data/nodes/0/indices/_PGlBYz8QAefU_2yen160Q/0/index/_221.fdt 11162910720 | 10.40 GB | /var/lib/graylog-datanode/opensearch/data/nodes/0/indices/B_Lp4CQBSI22kDZJUD8_Hg/0/index/_204.fdt ... (hundreds of similar files) ... ================================================================== TOTAL | 149.40 GB | Total Reclaimable Space -
Manual GC Attempt: We ran
jcmd 142008 GC.runmultiple times. The Full GC runs (verified viajstat), but disk space is NOT released. This suggests a strong reference leak, not just “lazy” G1GC behavior. -
JNA Issues: We also see
jna...tmp (deleted)files being held by this same wrapper process, which might indicate a native library issue. -
Restart Behavior: Restarting
graylog-datanodeimmediately frees the space, but the accumulation starts again with the next retention cycle.
Our Analysis: It appears the DataNode wrapper is opening file handles to read/monitor OpenSearch indices (perhaps for metrics or health checks) but fails to close these handles when OpenSearch rotates/deletes the underlying files. Because the wrapper has a strong lock, the kernel refuses to free the blocks.
Questions:
-
Why would the DataNode Wrapper maintain open file handles to OpenSearch index segments?
-
Since
GC.runfails to clear them, is there a known defect in the DataNode monitoring thread or JNA interaction that causes these handles to become “zombie” resources? -
Is there a workaround to force the wrapper to release these handles without a full service restart?
Any insights or recommended configuration changes would be appreciated.