This just started happening yesterday afternoon. My Graylog server which has been running almost flawlessly for over a year now has started freaking out. As a first attempt to fix the problem, we turned down log settings across our servers, so we were only getting error level messages and higher. This has demonstrably decreased our overall log input from triple-digits per second to double-digits, but the server is still struggling.
Next we increased the virtual hardware by a tier, now it has 32GB of RAM and 8CPUs, and I’ve set GL and ES heap sizes to use the larger amount of RAM. Performance is somewhat better now, but it’s still struggling. The Journal is starting to increase, and the process buffer is hovering around 60-65%. The only people logged in are admins, and none of us are running queries.
The most obvious thing I see is that the messages In/Out meter is dropping to 0 Out messages and staying there for a minute or more at a time. But the ES server doesn’t seem to be using many resources, other than disk writes.
The ES server is still running on the same VM as Graylog, and we’re actively working to move it onto AWS ES service, but I want to make sure this isn’t caused by something else.
Thank you in advance!