So we’ve built out a graylog infrastructure that handles about an average of 3000-4000 msg/s. 3 ES nodes, each 8TB storage and 64G of RAM with plenty of CPU (12 cores), backed by SSDs. Monitoring their resource usage indicates no problems; they have plenty of room to breathe.
Server node is similarly well-provisioned. 10 cores, 14GB of RAM. 7/6 processor/output threads, it can sustain over 6000 msgs/s
The problem is that the system will enter this state, without warning or obvious cause, where msg/s drops to 0 for several seconds, then bursts to several thousand momentarily, then returning to 0. Overall throughput in this state is around 10000 msgs/min, according to the metrics. Not enough to sustain our message input rate.
I can reliably return it to working by restarting all the ndoes in my ES cluster. The only thing I’ve noticed that seems, maybe, to set it off is large, demanding searches.
I’ve been bashing my head against a wall trying to figure this out for some time. I’m not sure where to look. No indications in logs that I can find, or in any of the graylog metrics. I do see lots of messages like this on the ES nodes:
[2018-11-15T15:55:52,096][INFO ][o.e.m.j.JvmGcMonitorService] [es-node-5] [gc][old][22092][8443] duration [5.3s], collections [1]/[6s], total [5.3s]/[38.5m], memory [15.5gb]->[15.1gb]/[15.9gb], all_pools {[young] [319.5mb]->[1.3mb]/[665.6mb]}{[survivor] [74mb]->[0b]/[83.1mb]}{[old] [15.1gb]->[15.1gb]/[15.1gb]}
[2018-11-15T15:55:52,096][WARN ][o.e.m.j.JvmGcMonitorService] [es-node-5] [gc][22092] overhead, spent [5.5s] collecting in the last [6s]
Try to monitor ES java heap usage
Under http://IP:9200/_nodes/NODE/stats check heap_used_percent
It should under 80% all time. (Over 90-95% OOM)
15.9/16=99%
An the GC young items should be more then the old one.
Try to close the old indices in Graylog, it won’t use memory in ES, and you can check the heap usage.
How much data do you store in ES? (System/Indices Total: XX indices, XX documents, XXTB)
Based on my experiences ES need about 1-2% of the stored data in heap memory.
3*16=48 GM heap, it should be OK for 4-5TB data (with 1 replica 2-2,5TB).
Thanks for info, macko003! The symptops are certainly consistent with an OOM condition on the ES cluster. Indeed, upping the RAM on the nodes to 64GB (with 30 to the JVM heap) seems to have resolved the issue - heap-used percent has fallen from 96 to the mid 60s, and I have not yet entered the stalled state since. I thought I had given it plenty of RAM, but I guess not…
Follow-up: what are you using, there, to monitor heap-used percent?
// You can close graylog indecies, and it won’t use ES memory, but also you can’t search in it.
We use a unique monitoring system, so in general the system call the ES API URL, http://IP:9200/_nodes/NODE_ID_or_IP/stats
here you can find a lot of usefull metrics.
eg. heap usage, search/index/fetch times, JVM stats, uptime, delete items, etc