ES high GC overhead issue

cdeng · April 11, 2018, 2:56am

I am using a 30GB Java heap for ES. but it seems that GC overhead is high:

[2018-04-11T10:13:51,438][INFO ][o.e.m.j.JvmGcMonitorService] [es3.mylogs.com] [gc][323] overhead, spent [318ms] collecting in the last [1s]
[2018-04-11T10:19:45,122][INFO ][o.e.m.j.JvmGcMonitorService] [es3.mylogs.com] [gc][676] overhead, spent [330ms] collecting in the last [1s]
[2018-04-11T10:24:23,237][INFO ][o.e.m.j.JvmGcMonitorService] [es3.mylogs.com] [gc][953] overhead, spent [381ms] collecting in the last [1s]
[2018-04-11T10:39:55,239][INFO ][o.e.m.j.JvmGcMonitorService] [es3.mylogs.com] [gc][1881] overhead, spent [313ms] collecting in the last [1s]

any way to tune it ?

jtkarvo · April 11, 2018, 4:57am

Notice that the garbage collecting happens quite rarely, so spending 300ms garbage collecting every 5 minutes is not very bad. If it bothers you, you can try reducing Java heap size, so that garbage collecting happens more often. But I would not do it blindly. You can install X-pack to your ES cluster and monitor Java heap usage of your nodes (the monitoring portion is free). Then you’ll see more precisely what is happening there.

cdeng · April 11, 2018, 8:05am

in normal status, there should no such INFO event. and when i start 3 gelf collector agents to achieve around 15K MPS input rate, the gc interval becoming short and gc overhead become bigger with WARN output:

[2018-04-11T15:43:39,512][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15146] overhead, spent [1s] collecting in the last [1.8s]
[2018-04-11T15:43:47,887][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][young][15152][13246] duration [2.1s], collections [1]/[2.8s], total [2.1s]/[35m], memory [13.4gb]->[12.2gb]/[29.8gb], all_pools {[young] [1.2gb]->[1.3mb]/[1.3gb]}{[survivor] [153.9mb]->[166.3mb]/[166.3mb]}{[old] [12gb]->[12gb]/[28.3gb]}
[2018-04-11T15:43:47,887][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15152] overhead, spent [2.1s] collecting in the last [2.8s]
[2018-04-11T15:43:50,982][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15155] overhead, spent [716ms] collecting in the last [1s]
[2018-04-11T15:44:38,226][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15202] overhead, spent [518ms] collecting in the last [1s]
[2018-04-11T15:44:44,019][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][young][15207][13282] duration [1s], collections [1]/[1.7s], total [1s]/[35.1m], memory [15.6gb]->[14.9gb]/[29.8gb], all_pools {[young] [944.4mb]->[12.1mb]/[1.3gb]}{[survivor] [166.3mb]->[163.4mb]/[166.3mb]}{[old] [14.6gb]->[14.7gb]/[28.3gb]}
[2018-04-11T15:44:44,019][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15207] overhead, spent [1s] collecting in the last [1.7s]
[2018-04-11T15:44:51,914][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][young][15210][13283] duration [5.4s], collections [1]/[5.7s], total [5.4s]/[35.2m], memory [16gb]->[14.9gb]/[29.8gb], all_pools {[young] [1.1gb]->[8.6mb]/[1.3gb]}{[survivor] [163.4mb]->[166.3mb]/[166.3mb]}{[old] [14.7gb]->[14.8gb]/[28.3gb]}
[2018-04-11T15:44:51,915][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15210] overhead, spent [5.4s] collecting in the last [5.7s]
[2018-04-11T15:46:00,120][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15277] overhead, spent [528ms] collecting in the last [1s]
[2018-04-11T15:46:03,577][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15280] overhead, spent [660ms] collecting in the last [1.1s]
[2018-04-11T15:46:36,193][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15311] overhead, spent [500ms] collecting in the last [1s]
[2018-04-11T15:46:38,194][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15313] overhead, spent [543ms] collecting in the last [1s]
[2018-04-11T15:46:52,503][WARN ][o.e.m.j.JvmGcMonitorService] [es2.mylogs.com] [gc][15326] overhead, spent [601ms] collecting in the last [1.1s]

at the same time, i observed the message journal become very large after serveral hours input running with high CPU usage. I suspect that instead of shrinking the JVM heap size, I should add more CPU cores for ES nodes. already 28 cores for each ES.

cdeng · April 11, 2018, 8:11am

also in the log of graylog node, there are the following logs:

2018-04-11T12:43:21.465+08:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-04-11T12:47:37.919+08:00 INFO  [connection] Opened connection [connectionId{localValue:22, serverValue:3375}] to mg1.mylogs.com:27017
2018-04-11T12:47:37.925+08:00 INFO  [connection] Opened connection [connectionId{localValue:21, serverValue:3374}] to mg1.mylogs.com:27017
2018-04-11T12:47:37.929+08:00 INFO  [connection] Opened connection [connectionId{localValue:23, serverValue:3376}] to mg1.mylogs.com:27017
2018-04-11T13:45:12.901+08:00 WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2018-04-11T14:00:14.448+08:00 INFO  [connection] Opened connection [connectionId{localValue:24, serverValue:3556}] to mg1.mylogs.com:27017
2018-04-11T14:13:14.728+08:00 INFO  [connection] Opened connection [connectionId{localValue:25, serverValue:3589}] to mg1.mylogs.com:27017

what is the exact mean of the above WARN messages ? does this mean the replication sync among the mongoDB cluster not quickly enough ?

jtkarvo · April 11, 2018, 12:28pm

You are probably right; this looks like it is not about too much JVM. X-Pack might shed some light to it, though.

I have had similar symptoms sometimes as in your second post (re-registering), so I am interested in the replies for this, too.

cdeng · April 11, 2018, 12:55pm

indeed, i also found the following log message on graylog node, it show that the communication among cluster have problem with the input:

2018-04-11T19:05:59.487+08:00 WARN  [ProxiedResource] Unable to call https://gl3.mylogs.com:9000/api/system/metrics/multiple on node <4c41017e-d0c8-42a8-989e-f61be79cafd5>
2018-04-11T19:49:05.490+08:00 WARN  [ProxiedResource] Unable to call https://gl2.mylogs.com:9000/api/system/metrics/multiple on node <ace5f034-7db7-45b2-8409-e8559ca3b8e7>
2018-04-11T19:49:07.170+08:00 WARN  [ProxiedResource] Unable to call https://gl2.mylogs.com:9000/api/system on node <ace5f034-7db7-45b2-8409-e8559ca3b8e7>

jtkarvo · April 11, 2018, 1:20pm

Look at this: https://github.com/Graylog2/graylog2-server/issues/2404
There is a timeout parameter for this; stale_master_timeout where default is 2s.

cdeng · April 11, 2018, 2:06pm

Yes. i also search out this. but it seems increase the timer only keep node from removed from the cluster, the long delay in response should lead to cumulative backlog of requests among the cluster.

i have use nginx to loadbalance logs from gelf agents into each graylog nodes. when i start the input, they are evenly distributed into all 3 graylog nodes, and after serveral hours, i check the graylog nodes, i found nginx have stopped to send log to one of those 3 nodes and all inputs get to left 2 graylog nodes.

I also noted that ES node have high frequency(almost 1ms) warn log output:

[2018-04-11T17:06:31,639][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,647][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,647][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,648][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,648][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,649][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,649][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,649][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,650][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,650][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,650][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,651][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,651][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,651][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,652][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,652][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]
[2018-04-11T17:06:31,653][WARN ][o.e.d.i.m.TypeParsers    ] Expected a boolean [true/false] for property [index] but got [not_analyzed]

this should be the graylog default template’s dynamic template using legacy setting “not_analyzed” instead of “false”:

"dynamic_templates" : [
          { "internal_fields" : { "match" : "gl2_*", "mapping" : { "type" : "keyword" } } },
          { "store_generic"   : { "match" : "*",     "mapping" : { "index" : "not_analyzed" } } }
        ]

cdeng · April 12, 2018, 3:42pm

i have observed the ES GC problem shortly pause graylog output to it under a low load(1/3 of target load). in that time one ES node has high CPU usage while other two ES nodes has low CPU usages.

although i have set stale_master_timeout to 10000 ms, but system report cluster sync issue:

on es1:

[2018-04-12T23:20:10,404][WARN ][o.e.d.z.PublishClusterStateAction] [es1.mylogs.com] received a cluster state from a different master than the current one, rejecting (received {es3.mylogs.com}{PM-3Q9lUSn6mhvE9wWairQ}{TLpY_hzJSHmj43IWAE2T9g}{es3.mylogs.com}{192.168.1.23:9300}, current {es1.mylogs.com}{2mQ1AOofQxyxpkQW5zPK9g}{QAa0eZSwScaba_rkCg3fIg}{es1.mylogs.com}{192.168.1.21:9300})
[2018-04-12T23:20:14,020][WARN ][o.e.d.z.ZenDiscovery     ] [es1.mylogs.com] discovered [{es3.mylogs.com}{PM-3Q9lUSn6mhvE9wWairQ}{TLpY_hzJSHmj43IWAE2T9g}{es3.mylogs.com}{192.168.1.23:9300}] which is also master but with an older cluster_state, telling [{es3.mylogs.com}{PM-3Q9lUSn6mhvE9wWairQ}{TLpY_hzJSHmj43IWAE2T9g}{es3.mylogs.com}{192.168.1.23:9300}] to rejoin the cluster ([node fd ping])

on es2:

[2018-04-12T23:29:36,307][WARN ][o.e.t.TransportService   ] [es2.mylogs.com] Received response for a request that has timed out, sent [42086ms] ago, timed out [12086ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es1.mylogs.com}{2mQ1AOofQxyxpkQW5zPK9g}{QAa0eZSwScaba_rkCg3fIg}{es1.mylogs.com}{192.168.1.21:9300}], id [216911]

on es3:

[2018-04-12T23:29:35,902][WARN ][o.e.t.TransportService   ] [es3.mylogs.com] Received response for a request that has timed out, sent [41662ms] ago, timed out [11662ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es1.mylogs.com}{2mQ1AOofQxyxpkQW5zPK9g}{QAa0eZSwScaba_rkCg3fIg}{es1.mylogs.com}{192.168.1.21:9300}], id [218597]
[2018-04-12T23:30:15,266][WARN ][o.e.t.TransportService   ] [es3.mylogs.com] Received response for a request that has timed out, sent [36318ms] ago, timed out [6317ms] ago, action [internal:discovery/zen/fd/master_ping], node [{es1.mylogs.com}{2mQ1AOofQxyxpkQW5zPK9g}{QAa0eZSwScaba_rkCg3fIg}{es1.mylogs.com}{192.168.1.21:9300}], id [218668]

Currently i am using openJDK. does oracle JVM work better on GC than openJDK?

cdeng · April 19, 2018, 8:05am

This kind information output very frequently in a pressure test, say 1-3 events per ms. i suspect those log output should affect the performance of ES. is there anyway to turn it off ? or how can we corrected the defination from “not_analyzed” to “false”? i have tried to modify the template, but it will lead to other errors.

jan · April 19, 2018, 10:58am

you could create a custom mapping in elasticsearch that change the field index to string or whatever.

the other option would be to only send boolean as content of the field index …

cdeng · April 19, 2018, 1:44pm

i have not catched your idea. Is not this “index” in the default index template an option controls whether field values are indexed per https://www.elastic.co/guide/en/elasticsearch/reference/5.6/mapping-index.html ?

as i can understand, the folowing pieces

store_generic"   : { "match" : "*",     "mapping" : { "index" : "not_analyzed" } }

control all other dynamic fields not to indexed ?

jan · April 19, 2018, 1:46pm

It looks like you have a field that is called index … at least from the error message.

Just create a custom mapping.
http://docs.graylog.org/en/2.4/pages/configuration/elasticsearch.html#custom-index-mappings

cdeng · April 19, 2018, 1:49pm

No. I have no such field named “index”. I guess it due to the one in the default index template…

I have doublecheck the output of the following command:

 curl -XGET -u graylog 'https://es1.mylogs.com:9200/_template/?pretty'

and i am sure there is no dynamc field named “index” send by my GELF agent.

system · May 3, 2018, 2:01pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Nodes with too long GC pauses Graylog Central (peer support)	5	4737	December 21, 2022
Nodes with too Long GC pauses with maxed ES ram Graylog Central (peer support) sidecar , filebeat-windows , alert , capacity_planning	7	1770	March 7, 2023
Graylog Heap Size Graylog Central (peer support)	6	26870	September 15, 2017
Elasticsearch Heap utilization Graylog Central (peer support)	5	1949	June 15, 2017
Elasticsearch Heap requirements skyrocketed Graylog Central (peer support)	5	1023	September 13, 2018

ES high GC overhead issue

Related topics