Hey!
I’ve had a lot of problems with my Graylog/Elasticsearch cluster lately, which I’ve managed, more or less, to stabilize by implementing various best practices that weren’t set when I got the cluster for my administration.
Now when the cluster is more or less stable I’m still having two problems which run through my logs:
First is in the graylog web interface under indexer failures:
RemoteTransportException[[es-node1][<ipaddress:port>][indices:data/write/bulk[s]]]; nested: RemoteTransportException[[es-node1][<ipaddress:port>][indices:data/write/bulk[s][p]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$4@32d2251 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@59ac05b8[Running, pool size = 32, active threads = 32, queued tasks = 50, completed tasks = 11182813]]];
It seems to me, that Elasticsearch isn’t able to process incoming messages fast enough, but what does this exactly mean? Alas, the documentation in this part is not helping me much. I started polling this metric every minute, and it seems, during active hours I average about 15K failures/minute - That does not sound too good… Am I losing log messages?
And second is in graylog-server log:
2017-06-20T08:44:21.643+03:00 ERROR [BlockingBatchedESOutput] Unable to flush message buffer
java.lang.ClassCastException: Cannot cast java.lang.String to org.joda.time.DateTime
at java.lang.Class.cast(Class.java:3369) ~[?:1.8.0_121]
at org.graylog2.plugin.Message.getFieldAs(Message.java:379) ~[graylog.jar:?]
at org.graylog2.plugin.Message.getTimestamp(Message.java:187) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.propagateFailure(Messages.java:160) ~[graylog.jar:?]
at org.graylog2.indexer.messages.Messages.bulkIndex(Messages.java:126) ~[graylog.jar:?]
at org.graylog2.outputs.ElasticSearchOutput.writeMessageEntries(ElasticSearchOutput.java:105) ~[graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.flush(BlockingBatchedESOutput.java:137) [graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.writeMessageEntry(BlockingBatchedESOutput.java:114) [graylog.jar:?]
at org.graylog2.outputs.BlockingBatchedESOutput.write(BlockingBatchedESOutput.java:96) [graylog.jar:?]
at org.graylog2.buffers.processors.OutputBufferProcessor$1.run(OutputBufferProcessor.java:194) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_121]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
It feels like graylog is expecting a timestamp at some point, but is getting something else. Can this be causing the indexer failures in the frontend and how could I find out what logs exactly cause this message?
Graylog version: 2.2.3
Elasticsearch version: 2.4.4
Any feedback would be appreciated.
Thank you!