A bit of background: We’re a small-ish ISP/MSP specializing in the education sector. We use Graylog to aggregate statistics from a wide variety of our services, such as email spam filtering, security tools, web hosting, etc. We seem to be pushing Graylog to its limits. We’re sending it about 5k messages per second, which it is injesting fine, but I have to be very careful if I run a search query, as it can cause the system to tip over. Injesting hangs, it is receiving the messages fine, but it can’t seem to index them, as outgoing messages goes to 0 per second. Sometimes it will recover on its own, in time, sometimes I have to restart the server, and hope it is able to come back up before my journal is filled.
But this is not a question about performance. After much fiddling and research and reading and experimentation, I have somewhat come to terms with the idea that during periods of high activity, I can’t expect to reliably run searches (though if anyone has any ideas or suggestion, I am open to hearing them). The problem is that we originally wanted to open up the graylog dashboard to our clients, to let them see, in real-time, how their services were doing, and what sort of value we were providing them. Suffice it say, that’s not really an option given that I can’t trust graylog to stay up even when it is just me verrrry caaaarrrefully running searches or loading dashboards - start letting clients run searches or open dashboards all willy-nilly, and it is sure to die.
What I’m wondering is if there is a way to export dashboards statically, or have them fronted by another service. The idea is that during periods of low activity, like at night, we run the necessary seraches to produce the dashboards for yesterday’s statistics, which we can then present to clients. We give up the ability to do it real-time, but I think we can live without that.
Running Kibana could be an alternative dashboarding option but I’m uncertain as to what the true funtionality outcome of that would be.
In terms of the performance issues you’re seeing it might be worth seperating out the roles a bit, maybe keep elasticsearch on seperate nodes and hardware if possible. If you could describe your environment more there might be more suggestions on how to optimise it.
So I’ve gone through a couple of architecture iterations since I started out. Originally, I had everything on one server, but that wasn’t scaling well, so I rebuilt it all and separated the graylog server from the elasticsearch cluster. That improved stability a little, but considering how bad it was to begin with, that’s not saying much. I tried adding a second elasticsearch node to the cluster, but that also had little impact. The (virtual) boxes seem well-provisioned, CPU usage and load averages are quite reasonable. Each machine has 6 cores, with 10GB of RAM on the graylog-server and 40GB on each of the elasticsearch nodes. None are using substantial amounts of swap space. Iops is a little harder to measure, but our storage infrastructure guys insist that while graylog is the single biggest consumer of iops, there are still plenty to spare. The graylog-server is backed onto SSDs, about 100GB, while the ES nodes are on large RAID arrays, each consuming about 8TB currently. I’ve tried adding both RAM and CPU to the nodes, but have not noticed an improvement in stability or performance after doing so.
I’ve heard of Kibana, but have never used it. So long as I can query-once and cache results, it might actually suit my purposes. I’ve never thought to have two systems accessing the same elasticsearch cluster, but why not?
You obviously know your stuff and your infrastructure seems pretty well laid out, I have to say I’m surprised you’re getting such poor performance given the layout - I’m still on the a single server stage on my current deployment (I have done larger deployments at other companies) and am ingesting 20gb+ a day without too much difficulty, searches are quick and dashboards display in a timely fashion. I have encountered a bug where the graphs of certain value counts don’t work correctly which I spotted in the logs. I’m hopeful that the issue I faced was just due to starting out on a beta version.
Worth checking your logs when it happens to watch for exceptions if you haven’t already and post your results, what version are you on out of interest?
We had some of the same trouble you are talking about, and have solved most of it over the past few months. It sounds like you have enough hardware to cover the messages coming in.
Things to think about:
How much HEAP is configured for each Elasticsearch node?
How much HEAP is configured for Graylog?
Do you have the disk journal enabled for Graylog? We did this on SSD, made a big difference.
If you look in System/Nodes -> Nodes and view details for the node in question, are any buffers full? Input, Processing, or Output?
In terms of iops on your ES nodes, what is your %iowait? iostat or top would show you this.
We’re on version 2.3.2 and are pushing just under 100GB/day. I haven’t noticed any exceptions in the logs, but I’ll go through them again.
Dustin, I’m not sure what you mean by HEAP. Are you referring to the memory given to the JVM? For each node, I’ve given the JVM half of the available RAM, as per the documentation’s recommendations; 5GB on the graylog-server and 20GB on each of the ES nodes.
Yes, we have a 15GB journal, backed by SSDs. Used to be 5GB rotating rust RAID. Moving it did not seem to make a noticable difference. Maximum injestion rate may have been slightly improved.
Under normal operation, the buffers are basically empty. If I try to perform a search, they will start to fill until graylog can ‘recover’. Sometimes that requires my manually restarting the server. Since I added a second elasticsearch node, I sometimes have to also force shard re-allocation, though I suspect that only happens if and when I restart the server.
It sounds like the Graylog portion of the stack is healthy (to me anyway).
Paying attention to just Elasticsearch:
In the past we have changed the index.refresh_interval from the default of 1 second up to 60 seconds to good effect, but after switching to the Hot/Warm architecture for ES I think we are back to the default. I still think ES is the place to troubleshoot.
Let me get a fresh map of our current configuration and I will post back to the group.
I tried setting the index.refresh_interval to 60s as per , but it only resulted in maybe a slight improvement to the maximum ingestion rate.
I should also mention that I kind-of-sort-of-not-really fixed my original problem, with static dashboards. By using a timeframe keyword ‘00:00 Yesterday to 23:59:59 Yesterday’, I can leverage the search cache built-into elastic search, as every search (for that dashboard) is now exactly the same. The problem still remains the first time I load the dashboard for the day’s data, so I either need a way to ‘preload’ the dashboard’s searches into the cache during periods of low activity, or fix the problem itself.