Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question. Don’t forget to select tags to help index your topic!
1. Describe your incident:
Event scheduler not working post graylog update.
Events now show
Status:
runnable
Next execution:
2022-12-12 15:38:11.064 (A few mins in the past)
Im guessing with a date in the past its never going to trigger.
Notifications work, everything that i can see works. Switched logging to debug and disabled and enabled events but to no avail.
Service logs, configurations, and environment variables:
Stand alone ubuntu server running
Version:
4.2.13+9c90b93, codename Noir
JVM:
PID 1101, Ubuntu 11.0.17 on Linux 5.4.0-1092-aws
Time:
2022-12-13 08:18:56 +00:00
3. What steps have you already taken to try and solve the problem?
we did do a snapshot but as this wasnt noticed we dont want to revert we want to fix forward.
Yes this does work in quotes, as we are looking for that specific string to alert on.
So the way i think it works -
When a new Event is made or modified, details are written to the mongoDB and a schedule is automatically made to trigger the check on the DB, when a match happens this creates the alert. Our alert seem to be written to the db but the internal schedule is not triggered.
Hence the last exicution message and no next execution message
another older unmodified alert-
The sting in quotes is working as we see a result given back onscreen in the filter preview. I think if that wasnt working then it maybe a case of no matches and no alerts.
Let me be clear no alerts are working. older events and newly created ones since the update
2022-12-15T09:57:27.037Z INFO [DiagnosticEventLogger] Current thread pool executor state: ExecutorStateEvent(executorName=SchedulerThreadPoolExecutor, currentQueueSize=0, activeThreads=0, coreThreads=0, leasesOwned=1, largestPoolSize=2, maximumPoolSize=2147483647)
2022-12-15T09:57:33.050Z INFO [Scheduler] Current stream shard assignments: shardId-000000000000
2022-12-15T09:57:33.050Z INFO [Scheduler] Sleeping …
2022-12-15T09:57:40.422Z INFO [DiagnosticEventLogger] Current thread pool executor state: ExecutorStateEvent(executorName=SchedulerThreadPoolExecutor, currentQueueSize=0, activeThreads=0, coreThreads=0, leasesOwned=1, largestPoolSize=2, maximumPoolSize=2147483647)
2022-12-15T09:57:42.042Z INFO [Scheduler] Current stream shard assignments: shardId-000000000000
2022-12-15T09:57:42.042Z INFO [Scheduler] Sleeping …
2022-12-15T09:57:44.049Z INFO [Scheduler] Current stream shard assignments: shardId-000000000000
2022-12-15T09:57:44.049Z INFO [Scheduler] Sleeping …
IF no alerts are working Are you sure the Notification that is attached to the Alert Event is working? What kind of Notification are you using?
For @gsmith’s point, the three instances of defining mongodb_uri likely would only take the last one defined, the previous value is usually overwritten when you define something more than once…
For the quoted search where you are looking for a snippet in the full message… yes that works… it’s just not efficient. In the example you have given, you are asking Graylog to search through all full messages that have come in for the past 28 hours for “Response Code: 96” … depending on the number of messages over that time, this could be a very expensive search. Graylog is designed so that when the message comes in, you can use extractors and/or the pipeline to break the full message to it’s constituent parts and it would allow for a way more efficient search… <find all response_code fields that have a value of 96 in the past 28 hours>. My initial through it that it failed the search or took to long since you were searching every minute through so much.
Hi Thanks for that.
So we continued to troubleshoot and restored a snapshot to another instance. There must have been a crash before updates as on the pre update snapshot was also broken. We did the same with a 7day earlier snap and all is working. Notifications events the lot.
The 28hr time frame was purely to trigger the event as thats when it had last aoccured in logs.
I think that OOM killer killed the Graylog process and something was damaged. we will try a restore with older snap to a bigger intance - more mem resources.
OK so we figured out the issue to some extent after our OG snapshot graylog ran for a about 12hours and also got the same issue.
We had an ongoing issue that triggered 20k logs and an alert that was triggered and tried to also to give us 20k notifications. The event scheduler broke well before that.
we have disabled the events that match the issue until our devs can address the issue and after disabling and rebooting the server events began to work again.