I’m not quite sure if I should post this here or create a GitHub issue, but let’s start here.
We have a following Graylog production setup running on CentOS 7.3 (all the components are installed from the yum repositories with Ansible):
- 3x Graylog 2.1.3 running on Java 1.8.0_92 with 8GB heap (12vCPU, 12GB RAM per virtual machine)
- 3x MongoDB 3.2.11 Replica Set
- 8x Elasticsearch 2.4.4
At April 3rd we faced an issue where Graylog suddenly stopped triggering all the alerts defined in different streams. We have been running this setup nearly a year (started with version 2.0.2) and nothing similar has ever happened before and we’ve been running version 2.1.3 since it was released. We ingest logs at an average speed of 800 messages per second and our current setup can handle the load without any visible problems.
The moment when the alerts stopped triggering, the following lines were output to the Graylog master-node server-log: https://pastebin.com/NMxQM3vZ
So it seems some kind of a connection error with MongoDB. Although the log files show that the connection to MongoDB was successfully re-established a moment later, the alerts did not start triggering until I manually restarted the Graylog master node.
So, do you have any insight about this? Could this be a bug in Graylog, or just a “glitch in the Matrix”? If there were some problems within the network (which can and will happen occasionally), I’m just wondering why Graylog did not start triggering alerts when the problems were resolved and connections to MongoDB replica set were re-established.