Alarm state not granular enough?


(william george) #1

Given my reading of the graylog alerting documentation, it seems like graylog can only reason about the “state” of an alert condition as a whole, is that right?

i.e. for this field content condition
{
“field”: “Severity”,
“value”: “Critical”
}

it would trigger one time and remain triggered as long as there is any unresolved alert about a critical message from any host. Is this correct?

Is there a way to write the alert such that the source or other fields are considered, without having to come up with every possible combination of sources and conditions?

i.e. given these events:
{
“source”: “box1”,
“severity”: “critical”,
“reason”: “BGP neighbor x.x.x.x down”
}
{
“source”: “box2”,
“severity”: “critical”,
“reason”: “BGP neighbor x.x.x.x down”
}
{
“source”: “box1”,
“severity”: “critical”,
“reason”: “BGP neighbor y.y.y.y down”
}
{
“source”: “box2”,
“severity”: “critical”,
“reason”: “BGP neighbor x.x.x.x down”
}
{
“source”: “box2”,
“severity”: “critical”,
“reason”: “High input temp: 140 F”
}

We need 4 alerts. The fact that we received and started processing an alert for Box1’s BGP neighbor x.x.x.x doesn’t mean we can ignore Box1’s BGP neighbor y.y.y.y or Box2’s temperature alarms. But I also can’t go around creating n*m alert conditions for every combination of alarm condition and alarm source.

What are my options here?


(Roger Mier) #2

Honestly, the way Graylog does alerts is going to cause you (and I) problems. IMHO, the Alert functionality as it is now, isn’t well thought out or implemented. It has no real way to track the state of an alert (and log messages aren’t really stateful, so I’m unsure whether this is even a useful concept for GL). And also no easy way to either manually close an Alert, or to mute a persistent/flapping problem. If 2 alert conditions happen less than a second apart, even if they come from different sources, GL will only trigger on the first, and expect you to pick up the others in the Alert backlog messages, or by going to the GL server and searching over the time period where it happened.
E.g.: We have multiple QA servers for testing. Each one is named qa#.example.com, and gets sorted into a Stream for just the QA servers. When I was just sorting like this I missed a lot. If QA1 alerted at the same time as QA2, only one or the other would trigger. Even with no grace period, and set to always alert, errors would get missed.

I’ve dealt with this as well as I can using Pipelines attached to a few ‘main streams’ which handle the coarse sort, e.g. streams and rules set for Prod, QA, Dev, Corp, etc. Then the pipeline takes over and sorts the messages from each numbered QA instance (qa1, qa2, qa3, etc) into its own Stream (QA1 Stream, QA2 Stream, etc. These Streams have no rules attached to them, they only receive messages from the pipeline), these Streams all have one, maybe 2 Alert Conditions attached to them. In my case, I alert on any level 3 error thrown by Linux. In your case as you already deduced, make a Stream for each source, and an Alert on each Stream.

This method has also allowed me to mute specific errors that are generating too much noise in Slack or email, by making Stage 0 a bunch of anti-patterns that won’t sort messages matching those patterns.
E.g.:
rule “no 403”
when
NOT contains(to_string($message.message), “is forbidden”)
then
end
So, if graylog sees the string ‘is forbidden’, Stage 0 will fail, and the message will stay in the main ‘coarse sort’ stream, and will never get alerted on.
I currently have about a dozen or so of these anti-patterns that I have to manage and later delete when the Dev team fixes the problem that I muted. But it’s better than Slack noise. When Slack’s signal-to-noise ratio is off, people stop paying attention, and important things get missed.


(Jan Doberstein) #3

He Roger
He William

maybe to clarify the alerts in Graylog are based on a search that is done (by default) all 60 seconds after the last alert search was successfull.
Alerts are not part of the processing or the messages are checked before the ingest to Elasticsearch (which is the last step in the processing). That might explain you why the alerting is acting like it is - it does regular search in your logs for given pattern. As long as this pattern is true, it alerts.

People that want more granular alerting tent to have their own monitoring and/or alerting tool and Graylog is just one part of the chain that is checked.


(Roger Mier) #4

Greetings Jan,

Thank you for the info, I was already aware of how it works, which is why I crafted the solution I have here, and I’m not trying to say there’s anything wrong with how it works TBH. My objection is simply about tracking alerts like something like Nagios or Zabbix would, which is how I feel it’s implemented now. Alerts aren’t granular enough, log files aren’t usually stateful enough to indicate when an alert clears (even if they were, there’s a chance that the recovery condition would be missed due to how the alert search works), you can’t clear them manually, and you can’t easily silence a flapping error. These are all things I’ve more-or-less worked around with Pipelines.
And yes, we have other monitoring solutions in place for watching our platform and infra. In fact it might surprise you that at my employer we don’t really use Graylog so much for monitoring the state of our hardware, or even the state of our web platform. We use it, with the Slack plugin, primarily as instant bug feedback for our devs and QA as they build and test new features. Which is an area I think Graylog particularly excels in!
But if a level 3 error on qa1 happens at the same time as qa2, both QA testers need to get alerted in their respective Slack channels, so I’ve had to make extensive use of Pipelines to do a more granular sorting and alerting. (To the point now that I think there’s practically nothing I could do in the UI that I couldn’t do as well or better using Pipelines).

All that having been said, I love Graylog and use it every day. It’s been invaluable for tracking down bugs in our software, and watching our platform. So thanks for your hard work!


(william george) #5

That lines up with what I’ve seen. Do you happen to have any recommendations for an alerting dashboard? We’re trying out Alerta right now, but nearly everything else I’ve looked at seems to be tightly focused around SEIM use cases, rather than broader operations stuff.


(Jan Doberstein) #6

Your way of using Graylog is one of the most used @Grakkal - or at least one classical use-case.

@thegreattriscuit I know that you can integrate Graylog with Icinga very well and that you can use dashing or something like to create a dashboard ( https://github.com/dnsmichi/dashing-icinga2 )


(system) #7

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.