I am new to graylog and figuring out over setup. Setup is as follows-
4 Graylog Servers
1 Graylog Webserver
(All 5 Graylog Servers have MongoDb with one Primary)
3 Elastic Search Data Node Servers
3 Elastic Search Master Node Servers
We got alerts on Graylog pool failure.
We were running at 90% diskspace utilization on ElasticSearch nodes.
Process Buffer was at 100%.
Out messages was at 0.
One Graylog server in the group was unresponsive.
What we did:
Using Graylog Web interface, from System -> Indices, picked around 30 indices (largest size index set) and deleted the oldest indices and also reduced the “Max number of indices” to a lower value (made sure it was higher than the current number being used as we were not at max number in the index set).
There was issue with one graylog servers nic and once that was resolved, the server came back online and started sharing the load.
Result after changes:
After that, we saw the alerts stopped.
Unprocessed messages count started going down (they are still a large number of unprocessed messages but the count is slowly going down in last 4 days)
Process buffer is still at 100%.
We see in/out messages flowing through all graylog servers, indices are increasing in number.
There are no new alerts.
ES cluster status is green for all indices.
We don’t see any messages what ever time period we select in the streams. Messages/sec is constantly incrementing/decrementing. (NTP looks like is syncing the clocks on all servers)
Any help from the community is much appreciated.