Hi Team,
I need help in how to reduce the time required for making my search queries run faster, and reduce the required run time.
Context: I have over 5 streams and the amount of time taken for each stream is quite some. Over 200ms. So I need the query time to be reduced. Can anyone help?
I could be wrong, but streams and queries are not directly linked are they?
- If I understand correctly, streams shift and move incoming data into various locations based on a ruleset.
- Queries will dig through all available indices, based on a specified time range, looking for relevant data.
@jan am I missing something?
nope @Totally_Not_A_Robot
when queries on streams take long - add more ressources to elasticsearch ā¦
What does your ES cluster look like? Generally speaking, slow query response means your ES cluster is at capacity and will need to have additional data nodes added to it. 200ms isnāt actually bad, mind youā¦
To illustrate, our setup currently runs with 19 data nodes, 3 masters, 3 routing nodes, 6.1bn documents @ 10.2Tb, in 252 indices. Queries for a random bit of text over the last 7 days will usually return in 500-700ms, if I specify a particular term (and value), itāll be 150-200ms.
Search speed also depends, partially, on replication. We have 3 replicas on all our indices (which means thereās always 4 different nodes that contain the data) so searches are spread across these 4 nodes and then results are combined. However, too many replicas will cause the overhead of the federated search and the combine phase to get too high, so itās one of those āyouāll have to figure out what the right settings are for your use caseā things.
(Additional edit: the size of our ES cluster is rather enormous, but itās a requirement for us to keep logs around in a searchable fashion for a long time, and of course high availability for everything, all the time, even if half the world burns)
I donāt understand that sentence though. How do you query a stream?? You query against indices, right? Or are we mixing two discussions and are you referring to queries that are part of a stream configuration?
Iāll say Holy heck dude, thatās some setup you have there!
OFF
@benvanstaveren
I have some questions about your cluster
How did you start with this cluster? Do you planned it, or just increase when it needed?
Do you have performance experiences before and after you added the master and route nodes?
Do you use the 3 replicas only for the fast search?
How much resources did you configure for the nodes (cpu, mem, heap, disk) (data, master, route)
Why do you use so much data nodes?
// I read your old āheavy useā post, but a lot of things mismatch
To answer the questions:
1: We had an ELK cluster (6.x) consisting of 3 masters and 9 data nodes, that was the initial sizing for the cluster. We then decided to move to Graylog, and at the time we did that Graylog didnāt support ES 6.x so we set up 9 data nodes with 5.6.x to mirror our existing setup, then when we felt Graylog hit all the right spots, and with 2.5 coming out that had support for ES 6.x we just decided to merge the clusters (or rather, decommission the old one and just repurpose the hardware). No real reason for it except that we expect to be needing the capacity this year, so it felt easier to just have it ready now than to have to scramble to add it later. And yes, that makes 18 - we had a spare floating around that we hooked up for the lulz.
2: Performance experiences: the 3 masters were always there, the router nodes took up a lot of the load for queries since before Graylog connected to one data node and sniffed out the rest, which meant you end up also forcing a data node to do the federated search, on top of itās regular index/query work, so that improved performance and stability a little.
3: 3 replicas are not just for the fast search (itās actually more of a fun side effect), we just need it for data resilience. With 3 replicas we can lose a third of the cluster and still be at 100% - the logs we keep drive a heck of a lot of business critical things, so they need to pretty much be available, no matter what happens.
4: We run the entire show on bare metal, the master nodes have 32Gb memory, on i7 quad core CPU. ES uses 16Gb heap. Router nodes are also i7 quad cores with 64Gb memory, and ES set to use 32Gb heap, data nodes have 64Gb memory, ES using 32Gb heap, on intel xeon CPUās.
Disk wise, the router and master nodes run 256Gb SSD in RAID 1, nothing special there, since they donāt really use much disk. Data nodes run 2x4Tb enterprise grade SATA drives in RAID 0 with XFS filesystem for storage - this is also where the replicas come in, if we lose a disk in a node it generally isnāt the end of the world, unless we lose 4 nodes at just the right time that contain the 4 replicas for a given index (statistically speaking youāre probably going to win the lottery before that happens)
5: Why so much data nodes? Because storage It turned out it was cheaper (significantly enough) to get more nodes with less storage capacity than a few nodes with huge storage capacity. We run all this on dedicated servers (unfortunately upper management still doesnāt want to co-locate, and Iām personally not a fan of running things like this on AWS/GCE) so it also helps that the data nodes are a āstandardā offering that we can order automatically - this means if a data node fails catastrophically, the cluster will recover itself quick (due to replicas), and I can spin up a replacement inside of 5 minutes while the broken one gets wiped and cancelled.
The heavy use post was the initial 9 node setup, I wanted to edit/amend it but the topic is archived so canāt seem to do that
It elucidate a lot of things, we run our servers on virtualized hardware, and I get ādiskā from dedicated storage. At the last 2 years I loose a node for few hours when we had a problem in a DC.
So I store only 1 replica, and do backup (also for archiving)
Thanks for the tips, I will do some research in this topic, and maybe increase the number of nodes. We have to a geo redundant cluster, and now the half cluster canāt handle the workdayās load.
//You can ask an admin to reopen the topic, I did it already.
Iāll add a disclaimer that this setup is probably pretty particular to our needs, so it may not be the best way to do it. But it works for us
We basically donāt back up our ES data nodes, on account of there being 3 replicas, so chances of 4 nodes all going down at just the right time to lose an index or two is astronomically low. Granted, I probably just jinxed it, but eh.
For archival we use an external tool that uses Graylogās API to query index sets and ranges, and selects indices that are no longer ārequiredā (e.g. only contains data > 90 days old), then the selected indices are all thrown into a single S3 snapshot, once thatās done the tool calls the Graylog API to delete the indices.
Itās currently in one of those āso seriously hacked together in a few hoursā states but if thereās any interest from anyone Iāll throw the thing on github after cleaning it up
Okay so that didnāt have anything to do with queries and speed butā¦ to re-cap and tl;dr it a little:
1: More replicas increases search speed, up to a certain point.
2: More nodes with less disk is (in my humble opinion) better than less nodes with more disk.
For cloud provisioning, I think also on the long term more instances with less dedicated storage is cheaper, but that may or may not be an issue
I donāt understand that sentence though. How do you query a stream?? You query against indices, right? Or are we mixing two discussions and are you referring to queries that are part of a stream configuration?
I guess I was to short in this sentence.
A query in a stream was refered to a search in the UI when selected a stream. Like a user with not full access would use.
@macko003 @benvanstaveren what posting should be re-opened?
Heh, see also the MongoU clustering and performance management courses. This is something they touch on as well: make the calculations between adding extra nodes or extra storage when you need to keep more data.
Something-something-split-data-center-strategy
just one more thing, what ES nodes addresses do you use in GL config?
Master nodes, route nodes, all?
Just the routing nodes, with sniffing disabled.
Muy gracias, Iāll make a post
I kind of grew into crunching the numbers back in the day where you could save a couple thousand euros by doing it - so I kept doing it. Weāve done it for other services too, our Nomad cluster that runs most of our infrastructure stuff is also made up of smaller servers, for price reason as well as redundancy (of sorts).
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.