Yesterday I had some minor issues in my Graylog cluster when users started to report there are no logs since 3pm. I started to investigate, but I couldn’t find anything interesting in the logs, so I’ve restarted both Graylog and ES and I found there are 7m+ unprocessed messages. This morning, processing is finished and everything looks fine, but when I check node stats I can see this message “Couldn’t get journal information”, but other than this, everything looks fin, ES GREEN, logs are coming at 2K rate

My version is

This cluster has two nodes (Graylog+ES and ES) each with 8GB RAM and 1 TB not-too-fast-storag plus 4 cores each.

As I wrote earlier, I have a sustained ingress rate at 2K to 4K and the total size of indexes are half terabytes.

So, my question are two-fold

  • how can I fix this journal metrics issue (I already increased the size a bit)
  • is my cluster has enough resources to serve this traffic? is there any sizing guide?

Thank you

he @vladx

this is a known issue. It will be fixed in 3.2.3

Your monitoring system should check it, not you. Monitor your envirolment.

I think this is something else.
#there is no reverse proxy, accessing Graylog on port 9000 using non-localhost address
#earlier -before the upgrade and/or issue- it was ok, so I was able to check journal utilisation from the Web UI

Any other idea?


he @vladx

sorry that I did not point directly to the PR that fixes the issue. This above linked issue was the “starter” to identify the problem:

the above is the fix.

