Hello,
Thanks for the added info.
Ah good, I was like holy cow, All the replicas and Data/directory’s are on the same Disk/HDD
Something that caught my eye. The node es-node-01 in the screen shot is using 3 times more CPU then the other two nodes. At first I though it maybe the master node but its not. Unsure if that was just a random metric or is it always like that?
Sum it up:
- 3 node cluster and all three service on each node
- Each node resources are 16 cores and 32 gigs of RAM.
- Each elasticsearch has 16GB RAM allocated.
- Graylog/Java has 3 GB heap allocated.
- Log ingestion is 300-500 GB day about 3000-5000messages per second.
- Indices 515 with 7170 active shards (that’s a lot for three node cluster).
- All Logs are clean
- I’m assuming the Process/Output/Input buffers & Journal are good.
Couple question on this statement
- Was this gradually or an over night issue?
- What changed before this issue started?
- Was there always 300-500 GB a day?
- Any updates applied? Server rebooted?
- Plugin’s Installed?
- Do you have Regex extractors, GROK patterns or Pipelines configured?
Its been known Java certain version could increase memory. I was curious about the amount of logs per day, If it was lower like 200 -300 then increase this also could have a impacted on resources. A bad regex expression or bad GROK pattern could be a culprit on high memory usage.
You do have large amounts of data, from the messages per day and how many shards are generating.
From what I’m seeing this is pretty normal for the amount of memory being used and for having all three services on each node. This is why the documents suggested that Elasticsearch should be on it own node and give it as much memory as possible. This way Graylog/MongoDb are not fighting over resources.
To be honest I feel something was changed, either log shippers are sending a lot more logs, perhaps new configuration were made or updates. Were you trending data on these cluster servers for the past month or two? If so, did you see anything that may pertain to this issue?
To give you an idea, here is my lab GL server with all three service on one node. I have 12 CPU , 12 GB memory and 500 GB drive. 4 Gb ram to Elasticsearch and 3 GB ram to GL heap. This server is only ingesting 30 Gb day. No replicas, only 4 shards per Index, retention is 1 day /deleted for 30 days.
As you can see I’m using about the same percentage of memory as you.
In the forum there has been issue with " Over Sharding" tying up memory not sure if this would pertain to your issue but If not the posts below are a good read.