How does the activity of my Graylog cluster (both ingestion of logs and searches) impact heap size utilization on my ES nodes?
I have 3 Graylog nodes and 4 ES nodes. The ES nodes each have 64GB physical ram, with ES_HEAP_SIZE set to 30g. I am using Kopf and ElasticHQ to keep an eye on things, and all 4 ES nodes show % heap used at 85 - 95 right now. One day last week when in a similar state my 3 Graylog nodes stopped sending messages to Elasticsearch altogther for a period. I restarted my ES nodes in a rolling fashion to get things moving again. Message ingestion rate is about average today for our setup, 3500 - 5000 per second. ES nodes processor utilization and system load are 10-15% and 2.5 - 5 for load 1min avg. (ES nodes have 2x E5-2620v3, so 12 cores / 24 threads per system)
Is the % heap used more an indication of searches performed, or indication that despite having more than ample disk space, I’ve outgrown 4 ES nodes?
I figured I’d post the question here rather than directly to support so others can benefit from the answer and discussion.
without looking into the Logfiles of Graylog and Elasticsearch it’s hard to get reason for that - but it might be related to usage of the System. That some search queries just knocked out Elasticsearch and send it to some kind of meditation.
I appreciate that you want the community to benefit from the findings, but I guess you did not want to share your Logs … so you can just open a support ticket containing your logs and we will take a look at it.
Circling back on this post, I discovered that one of the largest consumers of heap on our Elasticsearch nodes was the sheer amount of data we were retaining. My initial index cycling and retention in Graylog was 1D and keeping 365. At peak we reached about 42 billion messages across 4 ES nodes, with about 16TB of data per node. When we started out with Graylog early last year we spec’d data nodes with about 30TB of data space each, and not knowing how Elasticsearch ran I “threw a dart at a board” and set retention to one year. A few months back we chose to throttle back the retention period to 6 months rather than 12, deleting indices older than 6. EvenEarlier in the week our heap usage on our nodes was running around 90%. This time I cut down our retention to only 90 days, but rather than delete the older indices I close them, so that data is still available if we decide we need it at a later date. My average heap usage dropped from 85-90%, to 65-70%.
In the past few months I’ve heard recommendations of having no more than 4TB of data on any given Elasticsearch node, and also recommendations (from an Elasticsearch engineer who came onsite to assist with sizing a non-Graylog cluster) as low as only 1TB of data per node.
I didn’t know that each shard for each index consumes a small amount of heap, even if no queries have been run against them.
I’m still planning on moving our Elasticsearch cluster to a hot-warm architecture this summer, with my hot nodes using all SSD (and 128GB of ram, 31GB for heap and the remainder for filesystem cache). I plan on keeping no more than 30 days data on my hot nodes, but may choose to drop down to only 14, moving older indices to my warm nodes, and possibly even closing indices older than a certain point. If one of our data scientists needs to run historical queries against much older data, I can online those indices when needed.
This is interesting - the data per node information, I mean.
Do you have any pointers?
I’m still learning as I go, and unfortunately I think the answer to your question is the dreaded “it depends”. Our Graylog setup is like a small version of what Loggly does in that we’re promoting a “come one, come all” philosophy. So I’m receiving network device logs from the network team, web and db server metrics from another team, application specific logs from half a dozen teams. I have not yet broken them in to multiple index sets because we don’t have a good answer yet for how long a given set of data needs to be kept. Some folks are content to receive alerts based on given conditions from their data, others prefer dashboards, while others yet have requested access to their data via Kibana. The latter I’m allowing very sparingly because with the current way Graylog connects to Elasticsearch, I am not using Shield, and cannot limit Kibana queries based on role, etc. I have Kibana behind a Nginx reverse proxy using auth_basic, with iptables restricting access to the HTTP API across my cluster except within the cluster itself.
What I can say is that the articles below helped me understand a few things I could do to improve resource usage on our Graylog/ES clusters.
They were very helpful pointers. Thanks
This is interesting,( but perhaps worth another topic) - have you found any written recommendations for this.