Alerts seem to be slowing everything down?

So I’ve been tracking this issue off and on for quite a few months now and I’ve finally started making some sense of what is causing my random CPU spikes and slow log message processing. It seems to be caused by the HTTP Alarm Callbacks we recently started using to feed alerts out to a more friendly “Incidents” web page. At the moment we have 11 HTTP Callbacks and that seems to be about the limit where they start causing messages to back up in the Processing queue. As it is, I see these metrics and they certainly seem incredibly high to me.

org.graylog2.rest.resources.streams.alerts.StreamAlertResource.list

Timer

95th percentile: 101,443μs
98th percentile: 101,443μs
99th percentile: 101,443μs
Standard deviation: 1,038μs
Mean: 100,419μs
Minimum: 99,163μs
Maximum: 101,443μs

org.graylog2.rest.resources.alerts.AlertResource.listPaginated
Timer
95th percentile: 3,010,943μs
98th percentile: 3,010,943μs
99th percentile: 3,010,943μs
Standard deviation: 0μs
Mean: 3,010,943μs
Minimum: 3,010,943μs
Maximum: 3,010,943μs

org.graylog2.rest.resources.alerts.AlertResource.listRecent
Timer
95th percentile: 29,635μs
98th percentile: 29,635μs
99th percentile: 29,635μs
Standard deviation: 0μs
Mean: 29,635μs
Minimum: 28,915μs
Maximum: 39,835μs

org.graylog2.rest.resources.streams.alerts.AlertConditionsResource.all
Timer
95th percentile: 5,170,597μs
98th percentile: 5,170,597μs
99th percentile: 5,170,597μs
Standard deviation: 0μs
Mean: 5,170,597μs
Minimum: 5,170,597μs
Maximum: 5,170,597μs

org.graylog2.rest.resources.streams.alerts.AlertConditionsResource.available
Timer
95th percentile: 18,921μs
98th percentile: 18,921μs
99th percentile: 18,921μs
Standard deviation: 0μs
Mean: 18,921μs
Minimum: 18,921μs
Maximum: 18,921μs

So, I can see these metrics, but I don’t really know what to do about them. Like I said, we’re using these to POST incidents to a web page for our engineers. Is there maybe a better way I could be achieving this than using Alert Callbacks? Are HTTP Alert Callbacks really this slow to process? We do use a lot of Alerts to feed our Slack channels, and email us about serious incidents. Maybe this has been building for a while and I’ve not known until recently when it started being unable to process?
I know this isn’t the right place for a feature suggestion, but the Metrics page desperately needs an ‘Expand all’ button. 1453 metrics is way too many to open one at a time when you don’t know what might be causing the slowdown.

Thanks in advance!

How many processors do you have and how are they allocated? This is just a guess, but perhaps adding a CPU or 2 or modifying how many are handling the process buffer/output buffer might help. I’ve also had some luck with adjusting the Java Heap size to help the system run a bit smoother. Again, all guesses, but perhaps will help.

g’luck

Currently 16 vCPUs, and 32GB of RAM. I’ve ramped this box up to its current stats from a 2vCPU 4GB original, to combat this precise issue. But while the upgrades absolutely helped, it didn’t alleviate the issue. Don’t get me wrong, eventually the server catches up (now), but only after hours when things are quiet. Hardware makes a difference, but there’s got to be something else at work here. I guess it’s time to purge the HTTP alerts and see how that improves processing time. I wish there was a way to just disable the alerts rather than having to delete them and re-create them.
Thanks!

I have no insight into the http alerts, but I was having a similiar issue with message processing. While my timer metrics weren’t as large your timers I was able to help my system by tuning the Java Heap and how I allocated my processors in the server.conf file.

hth

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.