How does the activity of my Graylog cluster (both ingestion of logs and searches) impact heap size utilization on my ES nodes?
I have 3 Graylog nodes and 4 ES nodes. The ES nodes each have 64GB physical ram, with ES_HEAP_SIZE set to 30g. I am using Kopf and ElasticHQ to keep an eye on things, and all 4 ES nodes show % heap used at 85 - 95 right now. One day last week when in a similar state my 3 Graylog nodes stopped sending messages to Elasticsearch altogther for a period. I restarted my ES nodes in a rolling fashion to get things moving again. Message ingestion rate is about average today for our setup, 3500 - 5000 per second. ES nodes processor utilization and system load are 10-15% and 2.5 - 5 for load 1min avg. (ES nodes have 2x E5-2620v3, so 12 cores / 24 threads per system)
Is the % heap used more an indication of searches performed, or indication that despite having more than ample disk space, I’ve outgrown 4 ES nodes?
I figured I’d post the question here rather than directly to support so others can benefit from the answer and discussion.