Wrong calculation of "Next execution" in Graylog Alerts

1. My problem

Hi everyone, I have some troubles with the “Next execution” of Graylog Alerts. The “Next execution” is not calculated as configured

2. My system

  • OS Information: CentOS 7
  • Package Version: 4.3.5+32fa802
  • MongoDB v4.2.18
  • OpenSearch v1.3.5

3. What steps have you already taken to try and solve the problem?

  • Create new alerts (not work)
  • Disable and re-enable alerts (not work)
  • Update an existing alert (not work)

Suspected error log

4. How can the community help?

Show me how to debug or fix the issue.

Thank you,

Do you have any processing bottlenecks? Check System/Nodes > Details and see if any of the buffers are full. If processing is backed up, it can cause alerts to back up as well. Might explain the odd Next Timerange result.

1 Like

All buffers are almost empty

By the way, some alerts only have several seconds different between Last and Next

@quocbao
Hey i was looking over this, what I noticed was the Next timerange: is a day behind, by chance did you check the Timezone on this server? And do you have NTP installed this server?
Meaning do these line up?

System/Overview -->Time configuration

image

EDIT: Did this issue just start? if so, what was done prior to this issue. Update/Upgrades, etc…

Hi @gsmith,

Thanks for your reply. Here is my time configuration

Time drift on the MongoDB server (single node)

Time drift on Graylog servers.

The Graylog cluster has run well for months. This issue seems to be happening after an incident with our MongoDB incident several days ago. I have to use “kill -9”. Nothing in error logs of MongoDB related to the memory issue or other errors.

Thanks for your attention.

1 Like

Oh I see. So what ever happened with Mongodb now you having issues.

Have you try dumping graylog database and rebuild?
Make sure it’s clear, execute mongodump then Reinstall mongodb then upload graylog database back in.

1 Like

Hi @gsmith,

I dumped and restored all MongoDB collections to a new MongoDB 4.4 instance but the error kept happening.

At the same time, I found this

image

I guess db.getCollection('scheduler_triggers').find({"status": "runnable"}).count() can not bigger than db.getCollection('event_definitions').find({}).count().

Is there any mapping between event_definitions and scheduler_triggers so I can clean this mess?

I use this query to find suspicious documents

db.getCollection("scheduler_triggers").aggregate(
    [
        {
            "$group" : {
                "_id" : {
                    "job_definition_id" : "$job_definition_id"
                },
                "count" : {
                    "$sum" : NumberInt(1)
                }
            }
        }, 
        {
            "$project" : {
                "job_definition_id" : "$_id.job_definition_id",
                "count" : "$count",
                "_id" : NumberInt(0)
            }
        }, 
        {
            "$sort" : {
                "count" : NumberInt(-1)
            }
        }
    ], 
    {
        "allowDiskUse" : true
    }
);

I ran delete

> use graylog
switched to db graylog
> db.getCollection("scheduler_triggers").deleteMany({"$and": [{"job_definition_id":"6205e13fd0632503c3f052cc"}, {"triggered_at": null}]})
{ "acknowledged" : true, "deletedCount" : 488872 }
> 

Now Next execution of new events seems to be corrected

Still have problems with Next timerange of old events

image

I tried disabling the event and then re-enable it again, and … it works !!!

2 Likes

hey,
Oh wow,
This is kind strange, I wonder what made mongo do this.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.