Delayed event notifications sending in batches

1. Describe your incident:
I have two events setup so far in my proof of concept:

  1. Checking for a particular error message to alert me when a 3rd party API service is failing so I can contact them before we lose too many requests (runs every 5 minutes)

  2. Checking for absence of log messages which may hint at some other issue since I expect log messages for regular processing and this would be suspicious inactivity for the process, typically between 8am to 6pm but I have the event running all the time (runs every 1 hours)

Event 1 doesn’t have much frequency in sending notifications as the condition isn’t met very often.

Event 2 meets its conditions in the after hours when there is inactivity when users aren’t working.

I am worried I wont get the notifications when I need them because they seem to be delayed in triggering on the Graylog side from the past couple of days of getting the messages:
9/28/23: grouping of emails at 1am and then again at 7:14am
9/30/23: grouping of emails at 1am and then again at 10:22am
10/1/23: grouping of emails at 1am, and then again at 11:56pm

The groupings of emails seem to cover the period of inactivity I expect (13-15 hours of inactivity), they just send out all at once.

2. Describe your environment:

  • OS Information:
    Debian 11

  • Package Version:
    Docker v 24.0.5, build ced0996
    Image: graylog/graylog:5.1 (reported: 5.1.4+6fa2de3 on 86216c344ca4)
    Mongo: Image: mongo:6.0.5-jammy
    OpenSearch: Image: opensearchproject/opensearch:2

3. What steps have you already taken to try and solve the problem?
I’ve noticed if I have a duplicate Event 2 setup but pointing at the test environment for the process (which is inactive most of the day), I get emails more on-time instead of being delayed (I just don’t want notifications from this environment since this isn’t suspicious like it is in production).

4. How can the community help?
What kind of configuration or points should I verify as to why events are being delayed? Is there some nuance I might be missing?

If you look at the events that have fired in the graylog ui do they trigger at the correct times? Ie is it just the emails themselves that seem to be getting bunched up? You dont have any of the settings related to silencing notifications if an even has already fired recently correct?

At 7:30am this morning, I look in the Alerts & Events for the last 1 days, I see the latest event from 23:30 yesterday (10/1/2023) when I expect at least 7 more from today (10/2/2023). Seems like event triggering is delayed for some reason

My Event Settings
Type: Aggregation
Search Within: 1 hours
Executes Every: 1 hours
Enable Scheduling: yes
Grace Period is disabled

Global Events Configuration
Search Timeout: 1 minutes
Notification retry period: 5 minutes
Default notifications backlog size: 50
Catchup Window Size: 1 hours

So i would start by manually running the same search your event runs for the different time periods, see if they complete, what results you get, etc. And then also have a look at the server.log to see if anything sticks out.

After attaching to the graylog container and testing those different time periods, I see nothing in the stdout when running those periods manually. Apart from the 1am reboot of the service I am monitoring logs for, all the time periods have the expected empty results that should have triggered the events.

Is the issue because I am trying to trigger off of having no results for a time period?

(when we move out of POC I am thinking of running graylog directly instead of Docker, thinking that might make troubleshooting a bit easier but I’m not totally sure yet)

Yes, I have seen alerts looking for the absence of something behaving weirdly, can you post a screenshot or full details of the event definition?

I will try to post the pages that seem to have the more important details so I don’t flood this thread. I had it all in a PDF but cannot upload it being too new I guess
image

Second half of that page
image

Notifications Tab
image

So do you have nothing in the “Search Query” field, and the only filter you have it the stream?

Correct.
Lack of ANY messages at all may indicate a problem.
Should I have something there even if its just a *

I assume that when its left blank it defaults to * , but i would put it in there to be safe. Im doing some digging because im not actually sure that no results always returns as a zero and reliably works with <1.

Right now the log ingestion is just this service, does receiving messages kick off processing for events?

I was assuming there’s some subprocess that runs with the server that kicks off the events.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.