we have recently centralised all of our application logs using graylog, and it’s been working well.
We now want to collect system metrics (cpu , disk, network data) from our machines. As these are metrics and not logs, I don’t understand how to send them to graylog? I guess each metric would be a GELF log, which doesn’t make a lot of sense to me.
I think we should use another tool for metrics, for example Prometheus and Grafana.
Am I correct to want to use something other than graylog for metric dashboarding and alerting? Or should we use graylog for metrics too? The company I work for probably won’t like that I am asking for ANOTHER tool, so I need to justify it.
You could get metrics in to graylog / elasticsearch using one of the beats, but one shall have loads of work making things visible to a user and or management, and it shall use lots of data in your es nodes, especially if you need to go back (and you want that) to at least one year.
Siple to use and setup is check_mk for monitoring, having large support from a user base, large support for different and modern systems and easy to use and install/update/maintain thru one dep or rpm. Downside on the free open version is that the shortest monitoring interval is one minute.
What we do is send the events output to graylog and have a simple oversight on the events and escalate by graylog altering.
As @Karlis and @Arie said, Graylog is not the most efficient for capturing metrics, Zabbix (or in my case LibreNMS) is much better for my needs. But if you want to stick with Graylog, you can use Elasticsearch’s metricbeat to ship in metrics for tracking. You can pull those metrics from the Elasticsearch/OpenSearch(?) database directly with Grafana if you wanted to display via that route.
Graylog is for logs, not for monitoring of system properties. A monitoring is usually with low bandwidth and your primary interest is, if the value now is wrong or right.
Logging tends to be with a lot higher bandwidth, and you might have buffers filling up and being processed later. Therefore you will know that something is wrong way to late.
I like check_mk quite a lot, but that is only something personal.
For my home lab setup, I used Elastic’s Metricbeat to get processor/disk/memory metrics from a couple of machines into Graylog. All I had to do was set up a Beats input to catch the data and I was pretty quickly able to get some dashboards up and running to monitor host health.