Graylog api/system timeout failures

Hello,

Since today around 17:21 the graylog server logs are flooded with messages like:

2019-07-10T19:41:37.975+01:00 WARN  [ProxiedResource] Unable to call http://gray.log.ip.address:9000/api/system on node <node-id>
java.net.SocketTimeoutException: timeout.

I can call the api from the command line with and get a response:

$ curl -i 'http://gray.log.ip.address:9000/api/system/metrics/' -u user:pass

but this hangs:

$ curl -i 'http://gray.log.ip.address:9000/api/system/' -u user:pass

The web interface is working visibly slower and I cannot access the node API browser.

A little earlier I noticed I had indexer failures, but that was solved by setting “index.blocks.read_only_allow_delete”: null, not sure if this could possibly be related

Does anyone have any insight regarding this issue?

thanks in advance

Did you checked the available storage on your elasticsearch? cause it sets the indices to read-only if not enough space is given …

Earlier that day the disk occupation reached 95 %, this happened because of a planned reoganization of the LVMs on the disk. This means I grew the root filesystem to only 47 % occupation.

This lead to the problem of elasticsearch making the indices readonly, which I believe I fixed with:

“index.blocks.read_only_allow_delete”: null

However about the same time as I made this change, the api timeouts began.

So its not a disk space issue.

Development:

After loging in to graylog web interface, and requesting:

http://my.graylog.url:9000/api/system

I am prompted for authentication, could this be the source of the timeout? something is not properly authenticated?

meanwhile I have narrowed the timed out api calls to:

  • api/system
  • api/system/inputstate
  • api/system/jobs
  • api/system/metrics/multiple

To anyone having the same problem in the future, it could be your DNS.

High resolution times severely impacted graylog’s performance.

DNS troubles happened at the same time I was performing all I explained above, so I tought it had to be related to something I had done wrong.

I fixed it with a simple entry in /etc/hosts matching the interface’s IP address.