Index error and journal filling up on Graylog OVA

Hi there ,
Since a flood of incoming messages, processing of messages stopped.

Index error shows:

RemoteTransportException[[Cloud 9][192.168.251.20:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[Cloud 9][192.168.251.20:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@6d9f169a on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12e2511e[Running, pool size = 8, active threads = 8, queued tasks = 50, completed tasks = 486408]]];

The EL cluster is green

Does someone knows how to get the messages processing again?

Elasticsearch will start processing messages again as soon as it has completed the queued tasks and has capacity to process new tasks.

Hello,

I get same Indexing Failures in Graylog.
However, in my case it only happens when the indexes are being cycled, it would seem to clear out eventually though.

RemoteTransportException[[Taurus][10.36.20.151:9300][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[Taurus][10.36.20.151:9300][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@2677a163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@239894fc[Running, pool size = 8, active threads = 8, queued tasks = 54, completed tasks = 568556]]];

I have few concerns when these issues happens however:

  1. Are we actually losing data/logs when we receive these indexing failure messages? or will it retry until it’s successful?
  2. I observed this issue happening after modifying the following setting from the default values:
    output_batch_size = 3000
    processbuffer_processors = 8
    outputbuffer_processors = 5
    Would further increasing outputbuffer_processors to 8 perhaps help mitigate the failures on cycling events?

Thanks

Check that the Elasticsearch cluster has enough nodes and memory in the nodes.

You have to find out why tasks in Elasticsearch are piling up and fix this issue.

Just to explain:

you connect with 5 processors per graylog server every second to elasticsearch to push up to 3000 messages into the cluster. during the index cycle the elasticsearch cluster is not able to keep that pace.

raising the output buffer would make it even more worse, because more worker connect to elasticsearch. Did you modify the resent_interval in elasticsearch?

Hello Jan,

Sorry for getting back to you quite late.
Are you referring to index.refresh_interval in elasticasearch? we have not modified it.

Do you have any recommendations on tuning this setting?

Thanks a lot!

Update:

I have reduced the amount of output_buffer processors from 5 to 3.
The index cycle for this day has no data/write/bulk “Indexing Failures” :smiley:

I’ll observe for a couple more days.