Couldn't force merge index - any hint on where to troubleshoot

hi,

I got merging problems earlier, so I made indices smaller by changing retention settings. That lead to problems with a large number of shards (I have around 10000 now in my ES cluster). So I decided to come back towards larger indices, based on the advice found in some ES forum; sizing a shard to about 50G maximum. Now I got this index merge problem again. Any idea on where to start digging about these? What parameters would be of interest?

There are no errors in ES logs, but this is found in Graylog master node log: This is about 1 hour after index cycling.

2017-09-05T04:00:38.621+03:00 ERROR [SystemJobManager] Unhandled error while running SystemJob <3ff00ff0-91cd-11e7-a7a3-0050568617f3> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob]
org.graylog2.indexer.ElasticsearchException: Couldn't force merge index graylog_1543
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:52) ~[graylog.jar:?]
        at org.graylog2.indexer.indices.Indices.optimizeIndex(Indices.java:629) ~[graylog.jar:?]
        at org.graylog2.indexer.indices.jobs.OptimizeIndexJob.execute(OptimizeIndexJob.java:71) ~[graylog.jar:?]
        at org.graylog2.system.jobs.SystemJobManager$1.run(SystemJobManager.java:89) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:235) [graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_141]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_141]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_141]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_141]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_141]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_141]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_141]
        at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[graylog.jar:?]
        at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[graylog.jar:?]
        at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) ~[graylog.jar:?]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[graylog.jar:?]
        at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[graylog.jar:?]
        at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[graylog.jar:?]
        at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[graylog.jar:?]
        at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[graylog.jar:?]
        at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[graylog.jar:?]
        at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89) ~[graylog.jar:?]
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[graylog.jar:?]
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[graylog.jar:?]
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:150) ~[graylog.jar:?]
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:77) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:47) ~[graylog.jar:?]
        ... 11 more

The Elasticsearch cluster takes longer to perform the force merge of the index segments than the configured request timeout.

You have multiple options how to fix that:

Thanks! This is great. I did not notice that this option is now available - my old server.conf did not have that.

I’ll use a timeout of 11h for 12h cycling; I think there is no hurry in that optimization.

elasticsearch_index_optimization_timeout can be used to configure the request timeout for the force-merge request.

If the completion of a force-merge request takes 11 hours to complete, you have serious problems with the performance of your Elasticsearch cluster.

to be honest

my old server.conf did not have that.

that is why one would read the update announcement / the update documentation where such new or removed settings are explained.

It will probably not take that long. We’ll see that later. I just used that now to see what happens. The reality is that I don’t see any performance change in UI, whether the optimization is going on, or not. ES seems to be doing it in a leisurely way in the background.

Indeed. I have read the upgrade notes from the manual (e.g. http://docs.graylog.org/en/2.3/pages/upgrade/graylog-2.3.html) . These documents do tell which settings to remove from the config. Is there another place worth looking for when upgrading, or did I just not see?

Btw. my intention was not to criticize you with my comment on the old server.conf file.

The elasticsearch_index_optimization_timeout setting was already added in Graylog 2.2.0 and is only mentioned in the Graylog 2.2.0 changelog.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.