Couldn't force merge index - any hint on where to troubleshoot

jtkarvo · September 5, 2017, 5:43am

hi,

I got merging problems earlier, so I made indices smaller by changing retention settings. That lead to problems with a large number of shards (I have around 10000 now in my ES cluster). So I decided to come back towards larger indices, based on the advice found in some ES forum; sizing a shard to about 50G maximum. Now I got this index merge problem again. Any idea on where to start digging about these? What parameters would be of interest?

There are no errors in ES logs, but this is found in Graylog master node log: This is about 1 hour after index cycling.

2017-09-05T04:00:38.621+03:00 ERROR [SystemJobManager] Unhandled error while running SystemJob <3ff00ff0-91cd-11e7-a7a3-0050568617f3> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob]
org.graylog2.indexer.ElasticsearchException: Couldn't force merge index graylog_1543
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:52) ~[graylog.jar:?]
        at org.graylog2.indexer.indices.Indices.optimizeIndex(Indices.java:629) ~[graylog.jar:?]
        at org.graylog2.indexer.indices.jobs.OptimizeIndexJob.execute(OptimizeIndexJob.java:71) ~[graylog.jar:?]
        at org.graylog2.system.jobs.SystemJobManager$1.run(SystemJobManager.java:89) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:235) [graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_141]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_141]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_141]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_141]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_141]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_141]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_141]
        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[graylog.jar:?]
        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[graylog.jar:?]
        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) ~[graylog.jar:?]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[graylog.jar:?]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[graylog.jar:?]
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[graylog.jar:?]
        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[graylog.jar:?]
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[graylog.jar:?]
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[graylog.jar:?]
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[graylog.jar:?]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:150) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:77) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:47) ~[graylog.jar:?]
        ... 11 more

jochen · September 5, 2017, 6:50am

The Elasticsearch cluster takes longer to perform the force merge of the index segments than the configured request timeout.

You have multiple options how to fix that:

Reduce the index/shard sizes in Elasticsearch
Provide more resources (esp. IOPS) to your Elasticsearch nodes. Using SSDs is recommended.
Disable index optimization in the affected index sets: http://docs.graylog.org/en/2.3/pages/configuration/index_model.html#index-set-configuration
Increase the index optimization timeout:

github.com

Graylog2/graylog2-server/blob/2.3.1/misc/graylog.conf#L341-L343


# Global timeout for index optimization (force merge) requests.
# Default: 1h
#elasticsearch_index_optimization_timeout = 1h

jtkarvo · September 5, 2017, 8:55am

Thanks! This is great. I did not notice that this option is now available - my old server.conf did not have that.

I’ll use a timeout of 11h for 12h cycling; I think there is no hurry in that optimization.

jochen · September 5, 2017, 10:04am

elasticsearch_index_optimization_timeout can be used to configure the request timeout for the force-merge request.

If the completion of a force-merge request takes 11 hours to complete, you have serious problems with the performance of your Elasticsearch cluster.

jan · September 5, 2017, 10:42am

to be honest

my old server.conf did not have that.

that is why one would read the update announcement / the update documentation where such new or removed settings are explained.

jtkarvo · September 5, 2017, 11:02am

It will probably not take that long. We’ll see that later. I just used that now to see what happens. The reality is that I don’t see any performance change in UI, whether the optimization is going on, or not. ES seems to be doing it in a leisurely way in the background.

jtkarvo · September 5, 2017, 11:07am

Indeed. I have read the upgrade notes from the manual (e.g. http://docs.graylog.org/en/2.3/pages/upgrade/graylog-2.3.html) . These documents do tell which settings to remove from the config. Is there another place worth looking for when upgrading, or did I just not see?

Btw. my intention was not to criticize you with my comment on the old server.conf file.

jochen · September 5, 2017, 11:19am

The elasticsearch_index_optimization_timeout setting was already added in Graylog 2.2.0 and is only mentioned in the Graylog 2.2.0 changelog.

system · September 19, 2017, 11:19am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Even now still confused over relationship between RED/YELLOW and graylog Graylog Central (peer support)	2	551	September 30, 2017
Graylog Journal Filling Up Graylog Central (peer support)	4	805	June 19, 2017
Runaway Index and allocation failure Graylog Central (peer support)	6	988	April 24, 2018
Index error after one ES node crashed Graylog Central (peer support)	7	1835	October 19, 2021
ElasticSearch Errors after Upgrade to 6.6 - Graylog sends aggregation request Graylog Central (peer support)	3	312	April 4, 2019

Couldn't force merge index - any hint on where to troubleshoot

Related topics