Yesterday I had some minor issues in my Graylog cluster when users started to report there are no logs since 3pm. I started to investigate, but I couldn’t find anything interesting in the logs, so I’ve restarted both Graylog and ES and I found there are 7m+ unprocessed messages. This morning, processing is finished and everything looks fine, but when I check node stats I can see this message “Couldn’t get journal information”, but other than this, everything looks fin, ES GREEN, logs are coming at 2K rate
My version is
graylog-server-3.2.2-1.noarch
This cluster has two nodes (Graylog+ES and ES) each with 8GB RAM and 1 TB not-too-fast-storag plus 4 cores each.
As I wrote earlier, I have a sustained ingress rate at 2K to 4K and the total size of indexes are half terabytes.
So, my question are two-fold
how can I fix this journal metrics issue (I already increased the size a bit)
is my cluster has enough resources to serve this traffic? is there any sizing guide?
I think this is something else. #there is no reverse proxy, accessing Graylog on port 9000 using non-localhost address #earlier -before the upgrade and/or issue- it was ok, so I was able to check journal utilisation from the Web UI