How to make search queries run faster?


(Srijoni Biswas) #1

Hi Team,
I need help in how to reduce the time required for making my search queries run faster, and reduce the required run time.
Context: I have over 5 streams and the amount of time taken for each stream is quite some. Over 200ms. So I need the query time to be reduced. Can anyone help?


(Tess) #2

I could be wrong, but streams and queries are not directly linked are they?

  • If I understand correctly, streams shift and move incoming data into various locations based on a ruleset.
  • Queries will dig through all available indices, based on a specified time range, looking for relevant data.

@jan am I missing something?


(Jan Doberstein) #3

nope @Totally_Not_A_Robot

when queries on streams take long - add more ressources to elasticsearch …


(Ben van Staveren) #4

What does your ES cluster look like? Generally speaking, slow query response means your ES cluster is at capacity and will need to have additional data nodes added to it. 200ms isn’t actually bad, mind you…

To illustrate, our setup currently runs with 19 data nodes, 3 masters, 3 routing nodes, 6.1bn documents @ 10.2Tb, in 252 indices. Queries for a random bit of text over the last 7 days will usually return in 500-700ms, if I specify a particular term (and value), it’ll be 150-200ms.

Search speed also depends, partially, on replication. We have 3 replicas on all our indices (which means there’s always 4 different nodes that contain the data) so searches are spread across these 4 nodes and then results are combined. However, too many replicas will cause the overhead of the federated search and the combine phase to get too high, so it’s one of those “you’ll have to figure out what the right settings are for your use case” things.

(Additional edit: the size of our ES cluster is rather enormous, but it’s a requirement for us to keep logs around in a searchable fashion for a long time, and of course high availability for everything, all the time, even if half the world burns)


(Tess) #5

I don’t understand that sentence though. How do you query a stream?? You query against indices, right? Or are we mixing two discussions and are you referring to queries that are part of a stream configuration?

I’ll say :smiley: Holy heck dude, that’s some setup you have there!


#6

OFF

@benvanstaveren
I have some questions about your cluster
How did you start with this cluster? Do you planned it, or just increase when it needed?
Do you have performance experiences before and after you added the master and route nodes?
Do you use the 3 replicas only for the fast search?
How much resources did you configure for the nodes (cpu, mem, heap, disk) (data, master, route)
Why do you use so much data nodes?
// I read your old “heavy use” post, but a lot of things mismatch


(Ben van Staveren) #7

To answer the questions:

1: We had an ELK cluster (6.x) consisting of 3 masters and 9 data nodes, that was the initial sizing for the cluster. We then decided to move to Graylog, and at the time we did that Graylog didn’t support ES 6.x so we set up 9 data nodes with 5.6.x to mirror our existing setup, then when we felt Graylog hit all the right spots, and with 2.5 coming out that had support for ES 6.x we just decided to merge the clusters (or rather, decommission the old one and just repurpose the hardware). No real reason for it except that we expect to be needing the capacity this year, so it felt easier to just have it ready now than to have to scramble to add it later. And yes, that makes 18 - we had a spare floating around that we hooked up for the lulz.

2: Performance experiences: the 3 masters were always there, the router nodes took up a lot of the load for queries since before Graylog connected to one data node and sniffed out the rest, which meant you end up also forcing a data node to do the federated search, on top of it’s regular index/query work, so that improved performance and stability a little.

3: 3 replicas are not just for the fast search (it’s actually more of a fun side effect), we just need it for data resilience. With 3 replicas we can lose a third of the cluster and still be at 100% - the logs we keep drive a heck of a lot of business critical things, so they need to pretty much be available, no matter what happens.

4: We run the entire show on bare metal, the master nodes have 32Gb memory, on i7 quad core CPU. ES uses 16Gb heap. Router nodes are also i7 quad cores with 64Gb memory, and ES set to use 32Gb heap, data nodes have 64Gb memory, ES using 32Gb heap, on intel xeon CPU’s.

Disk wise, the router and master nodes run 256Gb SSD in RAID 1, nothing special there, since they don’t really use much disk. Data nodes run 2x4Tb enterprise grade SATA drives in RAID 0 with XFS filesystem for storage - this is also where the replicas come in, if we lose a disk in a node it generally isn’t the end of the world, unless we lose 4 nodes at just the right time that contain the 4 replicas for a given index (statistically speaking you’re probably going to win the lottery before that happens)

5: Why so much data nodes? Because storage :slight_smile: It turned out it was cheaper (significantly enough) to get more nodes with less storage capacity than a few nodes with huge storage capacity. We run all this on dedicated servers (unfortunately upper management still doesn’t want to co-locate, and I’m personally not a fan of running things like this on AWS/GCE) so it also helps that the data nodes are a ‘standard’ offering that we can order automatically - this means if a data node fails catastrophically, the cluster will recover itself quick (due to replicas), and I can spin up a replacement inside of 5 minutes while the broken one gets wiped and cancelled.

The heavy use post was the initial 9 node setup, I wanted to edit/amend it but the topic is archived so can’t seem to do that :frowning:


#8

It elucidate a lot of things, we run our servers on virtualized hardware, and I get “disk” from dedicated storage. At the last 2 years I loose a node for few hours when we had a problem in a DC.
So I store only 1 replica, and do backup (also for archiving)
Thanks for the tips, I will do some research in this topic, and maybe increase the number of nodes. We have to a geo redundant cluster, and now the half cluster can’t handle the workday’s load. :frowning:

//You can ask an admin to reopen the topic, I did it already.


(Ben van Staveren) #9

I’ll add a disclaimer that this setup is probably pretty particular to our needs, so it may not be the best way to do it. But it works for us :slight_smile:

We basically don’t back up our ES data nodes, on account of there being 3 replicas, so chances of 4 nodes all going down at just the right time to lose an index or two is astronomically low. Granted, I probably just jinxed it, but eh.

For archival we use an external tool that uses Graylog’s API to query index sets and ranges, and selects indices that are no longer “required” (e.g. only contains data > 90 days old), then the selected indices are all thrown into a single S3 snapshot, once that’s done the tool calls the Graylog API to delete the indices.

It’s currently in one of those “so seriously hacked together in a few hours” states but if there’s any interest from anyone I’ll throw the thing on github after cleaning it up :slight_smile:

Okay so that didn’t have anything to do with queries and speed but… to re-cap and tl;dr it a little:

1: More replicas increases search speed, up to a certain point.
2: More nodes with less disk is (in my humble opinion) better than less nodes with more disk.

For cloud provisioning, I think also on the long term more instances with less dedicated storage is cheaper, but that may or may not be an issue :slight_smile:


(Jan Doberstein) #10

I don’t understand that sentence though. How do you query a stream?? You query against indices, right? Or are we mixing two discussions and are you referring to queries that are part of a stream configuration?

I guess I was to short in this sentence.

A query in a stream was refered to a search in the UI when selected a stream. Like a user with not full access would use.

@macko003 @benvanstaveren what posting should be re-opened?


#11

if @benvanstaveren wish Users feedbacks / Guides for heavy load graylog Cluster


(Tess) #12

Heh, see also the MongoU clustering and performance management courses. This is something they touch on as well: make the calculations between adding extra nodes or extra storage when you need to keep more data.

Something-something-split-data-center-strategy :slight_smile:


#13

just one more thing, what ES nodes addresses do you use in GL config?
Master nodes, route nodes, all?


(Jan Doberstein) #14

opened the issue again

paging: @benvanstaveren


(Ben van Staveren) #15

Just the routing nodes, with sniffing disabled.


(Ben van Staveren) #16

Muy gracias, I’ll make a post :slight_smile:


(Ben van Staveren) #17

I kind of grew into crunching the numbers back in the day where you could save a couple thousand euros by doing it - so I kept doing it. We’ve done it for other services too, our Nomad cluster that runs most of our infrastructure stuff is also made up of smaller servers, for price reason as well as redundancy (of sorts).


(system) closed #18

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.