Troubleshooting periodic low performance

rexstuff · November 16, 2018, 9:47pm

Hi All

So we’ve built out a graylog infrastructure that handles about an average of 3000-4000 msg/s. 3 ES nodes, each 8TB storage and 64G of RAM with plenty of CPU (12 cores), backed by SSDs. Monitoring their resource usage indicates no problems; they have plenty of room to breathe.

Server node is similarly well-provisioned. 10 cores, 14GB of RAM. 7/6 processor/output threads, it can sustain over 6000 msgs/s

The problem is that the system will enter this state, without warning or obvious cause, where msg/s drops to 0 for several seconds, then bursts to several thousand momentarily, then returning to 0. Overall throughput in this state is around 10000 msgs/min, according to the metrics. Not enough to sustain our message input rate.

I can reliably return it to working by restarting all the ndoes in my ES cluster. The only thing I’ve noticed that seems, maybe, to set it off is large, demanding searches.

I’ve been bashing my head against a wall trying to figure this out for some time. I’m not sure where to look. No indications in logs that I can find, or in any of the graylog metrics. I do see lots of messages like this on the ES nodes:

[2018-11-15T15:55:52,096][INFO ][o.e.m.j.JvmGcMonitorService] [es-node-5] [gc][old][22092][8443] duration [5.3s], collections [1]/[6s], total [5.3s]/[38.5m], memory [15.5gb]->[15.1gb]/[15.9gb], all_pools {[young] [319.5mb]->[1.3mb]/[665.6mb]}{[survivor] [74mb]->[0b]/[83.1mb]}{[old] [15.1gb]->[15.1gb]/[15.1gb]}
[2018-11-15T15:55:52,096][WARN ][o.e.m.j.JvmGcMonitorService] [es-node-5] [gc][22092] overhead, spent [5.5s] collecting in the last [6s]

I am out of ideas and am looking for suggestions.

jan · November 17, 2018, 9:24am

did you have given the JVM HEAP for Elasticsearch ~30GB RAM?

rexstuff · November 18, 2018, 5:35pm

Sorry, I misrepresented my earlier numbers. The ES nodes are 40GB each, with a JVM heap of 16GB. (I could have sworn they were 64G)

macko003 · November 19, 2018, 9:33am

Try to monitor ES java heap usage
Under http://IP:9200/_nodes/NODE/stats check heap_used_percent
It should under 80% all time. (Over 90-95% OOM)
15.9/16=99%
An the GC young items should be more then the old one.

In my system:

Try to close the old indices in Graylog, it won’t use memory in ES, and you can check the heap usage.
How much data do you store in ES? (System/Indices Total: XX indices, XX documents, XXTB)

Based on my experiences ES need about 1-2% of the stored data in heap memory.
3*16=48 GM heap, it should be OK for 4-5TB data (with 1 replica 2-2,5TB).

rexstuff · November 20, 2018, 8:35pm

Thanks for info, macko003! The symptops are certainly consistent with an OOM condition on the ES cluster. Indeed, upping the RAM on the nodes to 64GB (with 30 to the JVM heap) seems to have resolved the issue - heap-used percent has fallen from 96 to the mid 60s, and I have not yet entered the stalled state since. I thought I had given it plenty of RAM, but I guess not…

Follow-up: what are you using, there, to monitor heap-used percent?

macko003 · November 21, 2018, 7:52am

// You can close graylog indecies, and it won’t use ES memory, but also you can’t search in it.

We use a unique monitoring system, so in general the system call the ES API URL,
http://IP:9200/_nodes/NODE_ID_or_IP/stats
here you can find a lot of usefull metrics.
eg. heap usage, search/index/fetch times, JVM stats, uptime, delete items, etc

system · December 5, 2018, 7:53am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog CPU spiked to 500% and ES seems slow Graylog Central (peer support)	6	1327	March 18, 2019
Graylog has stopped storing messages in ES - no errors in log Graylog Central (peer support) pipeline-rules , dump-messagespl , debuggingpl	6	1862	April 2, 2021
Performance Tuning Whitepaper, Guide, Doc Graylog Central (peer support)	5	4852	August 8, 2017
Again Graylog is backed up and slow to write out Graylog Central (peer support)	2	880	April 20, 2019
Messages getting queued, high load on es cluster Graylog Central (peer support)	2	694	April 6, 2020

Troubleshooting periodic low performance

Related topics