We see some search performance issues and I would like to be sure to go into the right direction prior to invest into new hardware and therefore would like to ask the community for their experience.
We’re running an infrastructure with 3 graylog and 3 elasticsearch nodes by which one elastic node is only an eligible master node without data. So there are only 2 data nodes in the elastic cluster. Unfortunately those data nodes or at least one of them aren’t equiped with very fast disks.
Input is around 4k messages per second. Constant maximum output is around 10k message per second. We have one index per day for audit and other messages each configured with one primary and one secondary shard. At the moment there are 225 indices, 450 shards and 17’977’430’475 documents in the elasticsearch cluster which consumes a disk space of 15.44TB data.
A search in in the last 5 minutes takes about 20 seconds to run over all messages. Aprox 3 seconds less in streams. A search in the last day (24 hour) takes about 60 seconds and during the search, after 20 seconds or so, the output stops sending messages to elasticsearch. After the search has finished messages are sent again and the journal is emptied.
During the search (no matter how much back in time) the elasticsearch node holding the primary shards is running at >90% cpu power. The average load of the system is around 8 and peaks at 12 during a search.
From my understanding of the elasticsearch infrastructure the load per cluster node decreases during searches by increasing the number of nodes. Is my asumption correct? Has somebody some experience with performance issues? The goal is to have search times for short range searchs under 3 seconds and for long range searchs under 10 seconds.
Thank you very much in advance for any suggestions on this.
Best regards, Stefan