Disk space issue, can't start

1. Describe your incident:
I am exploring Graylog. As such, I intentionally set my disk space low. Intending to break it and fix it with a real world problem. Best now as I am starting, than when in production with lots of data. I had the system up and running. HTTPS enable. Ingesting logs from many sources. Disk “filled” up. Web UI no longer responding. Only port 8999, not 9200 443 or other. Note, single server install.

2. Describe your environment:

  • OS Information:
    RHEL 9.5 (actually Oracle Linux)

  • Package Version:
    6.1.4-2

  • Service logs, configurations, and environment variables:

3. What steps have you already taken to try and solve the problem?
I added a new disk. partitioned, formatted, and mounted. Moved files (and permissions) from /var/lib/graylog-datanode/opensearch to /mnt/datanode/opensearch. edited datanode.conf and opensearch.yml to reflect new locations. rebooted for good measure. port 9200 opens for a brief second. server service will not start without indexer. I presume datanode won’t start due to being in read-only mode. If true, the only documentation I found to remove the read-only “flag” is using port 9200.

4. How can the community help?
I’m concerned that a disk full could require a complete system rebuild. So far I am not successful in restarting the datanode. Some suggestions I have found are incomplete, like run “vanilla opensearch” in order to remove read-only. What does that mean? How? I have tried executing opensearch, but it errors out. I am willing to “re-initialize” and start over, but don’t see instructions. I have intentionally put myself in this situation, since I can see it happening. It is concerning that I have yet to fix it. Advice would be helpful. Or potentially some assurance that paid support could fix this. The risk of losing TBs of data on a disk full is a bit concerning.

Thanks,
Bill

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

There are a lot of warnings and errors in the logs related to invalid SSL certificates, but I think this is a red herring. I will try to replace/update certificates, anyway.

There is a specific log entry that concerns me, as the fix is apparently impossible:

WARN [OpensearchProcessImpl] ClusterBlockException[index [.opensearch-observability] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];]

The problem is that as far as I can tell, to fix you need to send a config change over port 9200. However, this issue blocks the service from starting, thus no port 9200.

Hi @wbeavis,
Thanks for your bug report. The problem here is the opensearch observability plugin.

Recently we have removed it, but this change has not been backported yet. I assume this change will be a part of the next bugfix release.

Meanwhile, you can try to remove the plugin manually.

Your opensearch distribution, which is used by the datanode, should be located in /usr/share/graylog-datanode/dist/opensearch-2.15.0-linux-x64

There, you can trigger following command:

sudo bin/opensearch-plugin remove opensearch-observability

Without this plugin, opensearch won’t try to access .opensearch-observability during the startup, which could be enough for a clean startup. This will then give you access to the opensearch on the 9200 port.

We are also working on a user friendly way to remove index blocks during the startup.

Thank you. That helped me move forward. Unfortunately, I seem to have broken/reset something in terms of certificates. Datanode believes it is in preliminary state and is asking for certificates or to run the Graylog preflight interface. Without knowing specifically how to fix, I moved /etc/graylog/server to /etc/graylog/sever.old. Then ran yum reinstall graylog-server, in order to attempt to get back to an initial setup state. I then ran through the initial steps of setting the passwords in the .conf files and such. On startup, this was enough to trigger everything needed to bring back the interface. I still need to fix https and probably a few other items, but it does show old data and is collecting new. So I seem to be back running.

Thanks again. I feel like the plugin issue was enough blocking my progress.

Thank you for the feedback. Good to hear that the plugin removal helped.

Datanode certificates should not be anyhow related to that plugin. Datanode key and certificate is stored encrypted in the datanode.jks file in the datanode configuration directory on your local file system. If you configured a self-signed CA during the preflight, then that one is stored in the mongodb, shared by both datanode and the graylog server.

If you are talking about SSL certificates for the graylog server and the UI itself, these are configured differently and not related to the datanode. So the plugin should not break these.

If you will discover anything more that could causes troubles or help you, I’d like to hear that, so we can make the process as smooth as possible.

Thanks and best regards,
Tomas

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.