Notifications and events stopped firing completely

Hello, my firm has been using GrayLog 3.2.4 (non-enterprise), ES 6.8.8, and Mongo 3.6.3 since April of this year that has both filebeat and syslog inputs but have recently experienced an issue where alert/e-mail notifications have abruptly come to a full stop, even events are not triggering properly (I can see messages being routed into streams and matching the condition configured in the event). This is what I’ve established so far:

  1. Attempted restarting all services (Graylog, ES, MongoD) with no effect.

  2. This is not an issue between mail server and graylog as I can send test notifications fine (they come immediately)

  3. After a 2nd restart I noticed this message constantly in the log (after execute job message):

    2020-10-13T11:46:04.436-05:00 DEBUG [JobExecutionEngine] Execute job: pricefeeder-errors/5e908564de62587f7c3a2d2c/notification-execution-v1 (job-class=EventNotificationExecutionJob trigger=5f851b4bce4f277023fc1573 config=Config{type=notification-execution-v1, notificationId=5e908564de62587f7c3a2d2b})
    2020-10-13T11:46:04.439-05:00 ERROR [JobExecutionEngine] Unhandled job execution error - trigger=5f851b4bce4f277023fc1573 job=5e908564de62587f7c3a2d2c
    org.graylog2.indexer.messages.DocumentNotFoundException: Couldn’t find message <8a43bf35-0988-11eb-8fd1-246e9662a6c0> in index <graylog_83>
    at org.graylog2.indexer.messages.Messages.get(Messages.java:119) ~[graylog.jar:?]
    at org.graylog.events.processor.aggregation.AggregationEventProcessor.sourceMessagesForEvent(AggregationEventProcessor.java:148) ~[graylog.jar:?]
    at org.graylog.events.notifications.EventBacklogService.getMessagesForEvent(EventBacklogService.java:62) ~[graylog.jar:?]
    at org.graylog.events.notifications.EventNotificationService.getBacklogForEvent(EventNotificationService.java:49) ~[graylog.jar:?]
    at org.graylog.events.legacy.LegacyAlarmCallbackEventNotification.execute(LegacyAlarmCallbackEventNotification.java:54) ~[graylog.jar:?]
    at org.graylog.events.notifications.EventNotificationExecutionJob.execute(EventNotificationExecutionJob.java:135) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.executeJob(JobExecutionEngine.java:166) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.lambda$handleTrigger$2(JobExecutionEngine.java:144) ~[graylog.jar:?]
    at com.codahale.metrics.Timer.time(Timer.java:137) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.handleTrigger(JobExecutionEngine.java:144) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.lambda$execute$0(JobExecutionEngine.java:119) ~[graylog.jar:?]
    at org.graylog.scheduler.worker.JobWorkerPool.lambda$execute$0(JobWorkerPool.java:110) ~[graylog.jar:?]
    at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) [graylog.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]
    2020-10-13T11:46:04.456-05:00 DEBUG [JournallingMessageHandler] End of batch, journalling 1 messages

  4. I had seen the above messages hundreds of times, my theory was perhaps GrayLog was stuck checking on particular event condition, so I removed the event via the web GUI, then restarted GrayLog and then this message appeared repeatedly:

    2020-10-13T11:57:15.219-05:00 ERROR [EventNotificationExecutionJob] Couldn’t find event definition with ID <5e908599de62587f7c3a2d67>.
    2020-10-13T11:57:15.219-05:00 ERROR [JobExecutionEngine] Unhandled job execution error - trigger=5f851b4bce4f277023fc1d65 job=5e908564de62587f7c3a2d2c
    java.util.NoSuchElementException: No value present
    at java.util.Optional.get(Optional.java:135) ~[?:1.8.0_191]
    at org.graylog.events.notifications.EventNotificationExecutionJob.execute(EventNotificationExecutionJob.java:122) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.executeJob(JobExecutionEngine.java:166) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.lambda$handleTrigger$2(JobExecutionEngine.java:144) ~[graylog.jar:?]
    at com.codahale.metrics.Timer.time(Timer.java:137) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.handleTrigger(JobExecutionEngine.java:144) ~[graylog.jar:?]
    at org.graylog.scheduler.JobExecutionEngine.lambda$execute$0(JobExecutionEngine.java:119) ~[graylog.jar:?]
    at org.graylog.scheduler.worker.JobWorkerPool.lambda$execute$0(JobWorkerPool.java:110) ~[graylog.jar:?]
    at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) [graylog.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
    at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]

Following this post: Couldn't find job definition

…I backed up the Mongo DB and removed the schedule triggers, then restarted all services, this restored alerts after re-creating the events in GrayLog for a short period of time, about 2 hours. Earlier, we had been “spammed” by a super noisy application (one that would match against the aforementioned event/notification “pricefeeder-errors”) that produced millions of messages within a short time period (5M in about 2 hrs).

Currently, I am seeing the same job execute over and over, and no alerts/notifications:

2020-10-14T12:16:23.733-05:00 DEBUG [JobExecutionEngine] Execute job: pricefeeder-errors/5f85f3e64598686456832fff/notification-execution-v1 (job-class=EventNotificationExecutionJob trigger=5f8601e34598686456872814 config=Config{type=notification-execution-v1, notificationId=5f85f3e64598686456832ffe})
2020-10-14T12:18:48.374-05:00 DEBUG [EventNotificationExecutionJob] Notification <5f85f3e64598686456832ffe> triggered but it's in grace period.

Grace period for that event definition is set to 5 minutes, even if the grace period message above is correct, I should still be seeing alerts for other events, we have 15+ different event definitions that use different streams.

What can I do to restore alerts / notifications?

Thank you,

James