Our systems is a single node with about 1500 msg/s. System has been stable for the past 35 days since we created it. However 4 days ago our system process buffer filled up, as well as the disk journal. When this happens log output rates are way down, around the 20-100 mgs/s out. And stay that way until the server is rebooted. Once rebooted the systems will output around 4000-15000 msg/s till journal is empty then will maintain that until system breaks again.
While the system was stable for the past 35ish days, we have now seen a process buffer fill-up almost every 24 hours. This while there has been no apparent changes in logging in our environment, nor changes to the server itself.
If we shutoff all inputs, and let the buffer catchup till empty, then start up the inputs, it will continue at low output rate (20-100msg/s), input buffer will almost immediately fill-up and then disk journal will fill-up.
We’ve tried restarting the graylog and elasticsearch services to see if this will increase the speed of the output. However, thus far nothing except a full reboot works to getting it back to speed.
We have never been able to catch this when it happens so we are basing some of information on when it happens based on the alerts in graylog. Last time it errored we got the following errors:
Nodes with too long GC pauses (triggered 18 hours ago)
There are Graylog nodes on which the garbage collector runs too long. Garbage collection runs should be as short as possible. Please check whether those nodes are healthy. (Node: 602a0297-afdf-49ce-83aa-7b5b141aee1d , GC duration: 1379 ms , GC threshold: 1000 ms )
Journal utilization is too high (triggered 15 hours ago)
Journal utilization is too high and may go over the limit soon. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit. (Node: 602a0297-afdf-49ce-83aa-7b5b141aee1d )
Uncommited messages deleted from journal (triggered 15 hours ago)
Some messages were deleted from the Graylog journal before they could be written to Elasticsearch. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit. (Node: 602a0297-afdf-49ce-83aa-7b5b141aee1d )
Best guess is that there is some service maybe on elasticsearch that runs, at a certain time, and causes they system to not perform as it should. Again, this is just a guess. And even then, we haven’t seen a consistent time in which this happens.
Ubuntu 20.04.3 LTS
AWS plugins 4.2.5
Elasticsearch 6 Support 4.2.5+59802bf
Elasticsearch 7 Support 4.2.5+59802bf
Enterprise Integrations 4.2.5
Graylog Enterprise 4.2.5
Graylog Enterprise (ES6 Support)
Graylog Enterprise (ES7 Support)
Threat Intelligence Plugin 4.2.5