Graylog suddenly stopped triggering alerts

Hello,

I’m not quite sure if I should post this here or create a GitHub issue, but let’s start here.
We have a following Graylog production setup running on CentOS 7.3 (all the components are installed from the yum repositories with Ansible):

  • 3x Graylog 2.1.3 running on Java 1.8.0_92 with 8GB heap (12vCPU, 12GB RAM per virtual machine)
  • 3x MongoDB 3.2.11 Replica Set
  • 8x Elasticsearch 2.4.4

At April 3rd we faced an issue where Graylog suddenly stopped triggering all the alerts defined in different streams. We have been running this setup nearly a year (started with version 2.0.2) and nothing similar has ever happened before and we’ve been running version 2.1.3 since it was released. We ingest logs at an average speed of 800 messages per second and our current setup can handle the load without any visible problems.

The moment when the alerts stopped triggering, the following lines were output to the Graylog master-node server-log: https://pastebin.com/NMxQM3vZ

So it seems some kind of a connection error with MongoDB. Although the log files show that the connection to MongoDB was successfully re-established a moment later, the alerts did not start triggering until I manually restarted the Graylog master node.

So, do you have any insight about this? Could this be a bug in Graylog, or just a “glitch in the Matrix”? :slight_smile: If there were some problems within the network (which can and will happen occasionally), I’m just wondering why Graylog did not start triggering alerts when the problems were resolved and connections to MongoDB replica set were re-established.

Thanks!

Br,
Henri

Hej Henri,

did you have any plugins installed that provide additional notification?

If possible you might want to force this error by creating a network glitch in your setup and watch if that happen again. After that you can send us some information how to reproduce.

regards
Jan

Hey,

Do you mean alarm callback plugins? We use graylog-plugin-slack-2.4.0.jar and graylog-plugin-hipchat-1.3.0.jar to send alerts. And in addition, we use the default email callback.

Br,
Henri

Hej Henri,

did you check if they are compatible with the Graylog Version you are using?

Did you have the same issue if you remove those plugins?

regards
Jan

Hey,

At least they have been working fine ever since we upgraded to 2.1.3 when it was released. For now, I cannot reproduce this as this was a one-time issue (and hopefully will not recur). I was just wondering if the errors in the provided Graylog master-node server.log file would give you some input about this issue. But I think I’ll get back to you if this happens again. :slight_smile: Thanks anyway!

Br,
Henri