We are continually seeing searches fail for ‘trivial’ searches (ie not asking to search months of data/etc). Simply re-running the same search normally works - so it’s a transitory issue
A sniffer shows graylog sending ES the query and it returning a HTTP 500 error “Unable to perform search query.”
Looking at the ES logs I can see
[2017-11-02 08:05:49,628][DEBUG][action.search ] [kiwi] [graylog_6485], node[LcJGzDCvThmffdACkHwcmw], [R], v, s[STARTED], a[id=C2u1X6aiTIi8xXFVlIH4NQ]: Failed to execute [org.elasticsearch.action.search.SearchRequest@6d01236f] lastShard [true]
RemoteTransportException[[takahe][10.4.128.205:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@f9240b5 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@777d1fd1[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 89696]]];
So it looks like “the pool” was full at that moment as well as the queue tasks == queue capacity. But I don’t know what that means
Is this an indicator of a problem, or can I just increases whatever needs to be increased?
Thanks, this is GL-2.3.2 and ES-2.4.6
Your Elasticsearch cluster is operating at full capacity. While you could increase the relevant thread pools (and their queues), this would only mitigate the issue for a short time and lead to a ever increasing backlog.
You could try tuning Elasticsearch for you needs but the only good solution for this is to add more or better hardware (i. e. SSDs instead of spinning rust) to your Elasticsearch cluster.
For information about the Elasticsearch thread pools, please refer to Thread Pool | Elasticsearch Guide [2.4] | Elastic
Well that is weird. It’s a 4-node ES cluster with identical servers with 40 cores and 64G RAM each - with load averages down in the 2-5 range. That appears grossly over-speced to me. Admittedly the disks are 15K spinning rust - but is this really an I/O problem? I’m seeing 20MB/s writes via iotop - doesn’t seem busy to me?
I don’t know. That’s something you have to investigate on your machines.
Are the disks local or are you using a SAN (which then might either be too slow over the network or you have a noisy neighbor problem)?
The hardware specs themselves don’t mean much. They might be fine for 200000 messages/second but break down at 1000000 messages/second.
Maybe you’re also just using badly tuned Elasticsearch nodes.
But all of this is nothing I can help you with. If you’re a Graylog Enterprise customer, contact support to help you with pinpointing and possibly solving the performance problems.
What is the number of shards? I had these, too, but reducing the number of shards (making them bigger) helped.
We have 10G indexes with 4 shards each (and replicas=1 if it matters). Is
that big or small?
They are small. See https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
So you could have 20-40G shards. Now you have 2,5G shards, so you should make them 10 times bigger to be efficient.
Currently your index uses about 0,8G of ES JVM RAM each (or tries to: 4 shards + 4 copies each take 0,1G of RAM). If you multiply that number with the total number of indices you have in your index set and divide by the number of ES nodes, you get the amount of JVM in each node you need for the ES to work efficiently.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.