So I’ve been tracking this issue off and on for quite a few months now and I’ve finally started making some sense of what is causing my random CPU spikes and slow log message processing. It seems to be caused by the HTTP Alarm Callbacks we recently started using to feed alerts out to a more friendly “Incidents” web page. At the moment we have 11 HTTP Callbacks and that seems to be about the limit where they start causing messages to back up in the Processing queue. As it is, I see these metrics and they certainly seem incredibly high to me.
org.graylog2.rest.resources.streams.alerts.StreamAlertResource.list
Timer
95th percentile: 101,443μs
98th percentile: 101,443μs
99th percentile: 101,443μs
Standard deviation: 1,038μs
Mean: 100,419μs
Minimum: 99,163μs
Maximum: 101,443μs
org.graylog2.rest.resources.alerts.AlertResource.listPaginated
Timer
95th percentile: 3,010,943μs
98th percentile: 3,010,943μs
99th percentile: 3,010,943μs
Standard deviation: 0μs
Mean: 3,010,943μs
Minimum: 3,010,943μs
Maximum: 3,010,943μs
org.graylog2.rest.resources.alerts.AlertResource.listRecent
Timer
95th percentile: 29,635μs
98th percentile: 29,635μs
99th percentile: 29,635μs
Standard deviation: 0μs
Mean: 29,635μs
Minimum: 28,915μs
Maximum: 39,835μs
org.graylog2.rest.resources.streams.alerts.AlertConditionsResource.all
Timer
95th percentile: 5,170,597μs
98th percentile: 5,170,597μs
99th percentile: 5,170,597μs
Standard deviation: 0μs
Mean: 5,170,597μs
Minimum: 5,170,597μs
Maximum: 5,170,597μs
org.graylog2.rest.resources.streams.alerts.AlertConditionsResource.available
Timer
95th percentile: 18,921μs
98th percentile: 18,921μs
99th percentile: 18,921μs
Standard deviation: 0μs
Mean: 18,921μs
Minimum: 18,921μs
Maximum: 18,921μs
So, I can see these metrics, but I don’t really know what to do about them. Like I said, we’re using these to POST incidents to a web page for our engineers. Is there maybe a better way I could be achieving this than using Alert Callbacks? Are HTTP Alert Callbacks really this slow to process? We do use a lot of Alerts to feed our Slack channels, and email us about serious incidents. Maybe this has been building for a while and I’ve not known until recently when it started being unable to process?
I know this isn’t the right place for a feature suggestion, but the Metrics page desperately needs an ‘Expand all’ button. 1453 metrics is way too many to open one at a time when you don’t know what might be causing the slowdown.
Thanks in advance!