I am new to graylog and figuring out over setup. Setup is as follows-
4 Graylog Servers
1 Graylog Webserver
(All 5 Graylog Servers have MongoDb with one Primary)
3 Elastic Search Data Node Servers
3 Elastic Search Master Node Servers
We got alerts on Graylog pool failure.
We were running at 90% diskspace utilization on ElasticSearch nodes.
Process Buffer was at 100%.
Out messages was at 0.
One Graylog server in the group was unresponsive.
What we did:
Using Graylog Web interface, from System -> Indices, picked around 30 indices (largest size index set) and deleted the oldest indices and also reduced the “Max number of indices” to a lower value (made sure it was higher than the current number being used as we were not at max number in the index set).
There was issue with one graylog servers nic and once that was resolved, the server came back online and started sharing the load.
Result after changes:
After that, we saw the alerts stopped.
Unprocessed messages count started going down (they are still a large number of unprocessed messages but the count is slowly going down in last 4 days)
Process buffer is still at 100%.
We see in/out messages flowing through all graylog servers, indices are increasing in number.
There are no new alerts.
ES cluster status is green for all indices.
We don’t see any messages what ever time period we select in the streams. Messages/sec is constantly incrementing/decrementing. (NTP looks like is syncing the clocks on all servers)
I am seeing that the indexes are being rotated and the current index is in active write state for the index set. Older ones are at read. Isn’t this expected? I would assume once rotated we cant and should not try to put the index set to read-write?
2019-08-26T07:46:12-07:00 5efd5f0d / xxxx Optimizing index <firewall_729>.
2019-08-26T07:46:12-07:00 5efd5f0d / xxxx Flushed and set <firewall_729> to read-only.
2019-08-26T07:45:41-07:00 5efd5f0d / xxxxl Cycled index alias <firewall_deflector> from <firewall_729> to <firewall_730>.
ntp could be an issue, could also be the source is sending the log files in with a timestamp that is ahead of your graylog instance. IE UTC and the graylog is UTC-4, or graylog is UTC and the logs are UTC+4. Another issue is if your process buffer is 100%, that means your journal is probably filling up and behind on current messages. If you check the journal status, you can see the the age of the oldest message. Depending on the journal size, you can easily get 2-3 hours behind.
You haven’t mentioned versions of anything, so perhaps you can shed some light on your setup a bit more. Is there a load balancer involved or are the sources sending to individual GL servers? Also, for being new to Graylog, you really jumped in with a pretty complex setup. can you provide more details on the servers themselves? CPU/RAM/HDD? There are tweaks that can be made, but we’d need more information.