We are continually seeing searches fail for ‘trivial’ searches (ie not asking to search months of data/etc). Simply re-running the same search normally works - so it’s a transitory issue
A sniffer shows graylog sending ES the query and it returning a HTTP 500 error “Unable to perform search query.”
Looking at the ES logs I can see
[2017-11-02 08:05:49,628][DEBUG][action.search ] [kiwi] [graylog_6485][3], node[LcJGzDCvThmffdACkHwcmw], [R], v[11], s[STARTED], a[id=C2u1X6aiTIi8xXFVlIH4NQ]: Failed to execute [org.elasticsearch.action.search.SearchRequest@6d01236f] lastShard [true]
RemoteTransportException[[takahe][10.4.128.205:9300][indices:data/read/search[phase/query]]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler@f9240b5 on EsThreadPoolExecutor[search, queue capacity = 1000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@777d1fd1[Running, pool size = 49, active threads = 49, queued tasks = 1000, completed tasks = 89696]]];
So it looks like “the pool” was full at that moment as well as the queue tasks == queue capacity. But I don’t know what that means
Is this an indicator of a problem, or can I just increases whatever needs to be increased?
Your Elasticsearch cluster is operating at full capacity. While you could increase the relevant thread pools (and their queues), this would only mitigate the issue for a short time and lead to a ever increasing backlog.
You could try tuning Elasticsearch for you needs but the only good solution for this is to add more or better hardware (i. e. SSDs instead of spinning rust) to your Elasticsearch cluster.
Well that is weird. It’s a 4-node ES cluster with identical servers with 40 cores and 64G RAM each - with load averages down in the 2-5 range. That appears grossly over-speced to me. Admittedly the disks are 15K spinning rust - but is this really an I/O problem? I’m seeing 20MB/s writes via iotop - doesn’t seem busy to me?
I don’t know. That’s something you have to investigate on your machines.
Are the disks local or are you using a SAN (which then might either be too slow over the network or you have a noisy neighbor problem)?
The hardware specs themselves don’t mean much. They might be fine for 200000 messages/second but break down at 1000000 messages/second.
Maybe you’re also just using badly tuned Elasticsearch nodes.
But all of this is nothing I can help you with. If you’re a Graylog Enterprise customer, contact support to help you with pinpointing and possibly solving the performance problems.
So you could have 20-40G shards. Now you have 2,5G shards, so you should make them 10 times bigger to be efficient.
Currently your index uses about 0,8G of ES JVM RAM each (or tries to: 4 shards + 4 copies each take 0,1G of RAM). If you multiply that number with the total number of indices you have in your index set and divide by the number of ES nodes, you get the amount of JVM in each node you need for the ES to work efficiently.