How to make search queries run faster?

Srijoni · January 23, 2019, 3:05pm

Hi Team,
I need help in how to reduce the time required for making my search queries run faster, and reduce the required run time.
Context: I have over 5 streams and the amount of time taken for each stream is quite some. Over 200ms. So I need the query time to be reduced. Can anyone help?

Totally_Not_A_Robot · January 24, 2019, 7:30am

I could be wrong, but streams and queries are not directly linked are they?

If I understand correctly, streams shift and move incoming data into various locations based on a ruleset.
Queries will dig through all available indices, based on a specified time range, looking for relevant data.

@jan am I missing something?

jan · January 24, 2019, 8:30am

nope @Totally_Not_A_Robot

when queries on streams take long - add more ressources to elasticsearch …

benvanstaveren · January 24, 2019, 8:45am

What does your ES cluster look like? Generally speaking, slow query response means your ES cluster is at capacity and will need to have additional data nodes added to it. 200ms isn’t actually bad, mind you…

To illustrate, our setup currently runs with 19 data nodes, 3 masters, 3 routing nodes, 6.1bn documents @ 10.2Tb, in 252 indices. Queries for a random bit of text over the last 7 days will usually return in 500-700ms, if I specify a particular term (and value), it’ll be 150-200ms.

Search speed also depends, partially, on replication. We have 3 replicas on all our indices (which means there’s always 4 different nodes that contain the data) so searches are spread across these 4 nodes and then results are combined. However, too many replicas will cause the overhead of the federated search and the combine phase to get too high, so it’s one of those “you’ll have to figure out what the right settings are for your use case” things.

(Additional edit: the size of our ES cluster is rather enormous, but it’s a requirement for us to keep logs around in a searchable fashion for a long time, and of course high availability for everything, all the time, even if half the world burns)

Totally_Not_A_Robot · January 24, 2019, 9:22am

I don’t understand that sentence though. How do you query a stream?? You query against indices, right? Or are we mixing two discussions and are you referring to queries that are part of a stream configuration?

I’ll say Holy heck dude, that’s some setup you have there!

macko003 · January 24, 2019, 9:46am

OFF

@benvanstaveren
I have some questions about your cluster
How did you start with this cluster? Do you planned it, or just increase when it needed?
Do you have performance experiences before and after you added the master and route nodes?
Do you use the 3 replicas only for the fast search?
How much resources did you configure for the nodes (cpu, mem, heap, disk) (data, master, route)
Why do you use so much data nodes?
// I read your old “heavy use” post, but a lot of things mismatch

benvanstaveren · January 24, 2019, 10:15am

To answer the questions:

1: We had an ELK cluster (6.x) consisting of 3 masters and 9 data nodes, that was the initial sizing for the cluster. We then decided to move to Graylog, and at the time we did that Graylog didn’t support ES 6.x so we set up 9 data nodes with 5.6.x to mirror our existing setup, then when we felt Graylog hit all the right spots, and with 2.5 coming out that had support for ES 6.x we just decided to merge the clusters (or rather, decommission the old one and just repurpose the hardware). No real reason for it except that we expect to be needing the capacity this year, so it felt easier to just have it ready now than to have to scramble to add it later. And yes, that makes 18 - we had a spare floating around that we hooked up for the lulz.

2: Performance experiences: the 3 masters were always there, the router nodes took up a lot of the load for queries since before Graylog connected to one data node and sniffed out the rest, which meant you end up also forcing a data node to do the federated search, on top of it’s regular index/query work, so that improved performance and stability a little.

3: 3 replicas are not just for the fast search (it’s actually more of a fun side effect), we just need it for data resilience. With 3 replicas we can lose a third of the cluster and still be at 100% - the logs we keep drive a heck of a lot of business critical things, so they need to pretty much be available, no matter what happens.

4: We run the entire show on bare metal, the master nodes have 32Gb memory, on i7 quad core CPU. ES uses 16Gb heap. Router nodes are also i7 quad cores with 64Gb memory, and ES set to use 32Gb heap, data nodes have 64Gb memory, ES using 32Gb heap, on intel xeon CPU’s.

Disk wise, the router and master nodes run 256Gb SSD in RAID 1, nothing special there, since they don’t really use much disk. Data nodes run 2x4Tb enterprise grade SATA drives in RAID 0 with XFS filesystem for storage - this is also where the replicas come in, if we lose a disk in a node it generally isn’t the end of the world, unless we lose 4 nodes at just the right time that contain the 4 replicas for a given index (statistically speaking you’re probably going to win the lottery before that happens)

5: Why so much data nodes? Because storage It turned out it was cheaper (significantly enough) to get more nodes with less storage capacity than a few nodes with huge storage capacity. We run all this on dedicated servers (unfortunately upper management still doesn’t want to co-locate, and I’m personally not a fan of running things like this on AWS/GCE) so it also helps that the data nodes are a ‘standard’ offering that we can order automatically - this means if a data node fails catastrophically, the cluster will recover itself quick (due to replicas), and I can spin up a replacement inside of 5 minutes while the broken one gets wiped and cancelled.

The heavy use post was the initial 9 node setup, I wanted to edit/amend it but the topic is archived so can’t seem to do that

macko003 · January 24, 2019, 10:46am

It elucidate a lot of things, we run our servers on virtualized hardware, and I get “disk” from dedicated storage. At the last 2 years I loose a node for few hours when we had a problem in a DC.
So I store only 1 replica, and do backup (also for archiving)
Thanks for the tips, I will do some research in this topic, and maybe increase the number of nodes. We have to a geo redundant cluster, and now the half cluster can’t handle the workday’s load.

//You can ask an admin to reopen the topic, I did it already.

benvanstaveren · January 24, 2019, 10:55am

I’ll add a disclaimer that this setup is probably pretty particular to our needs, so it may not be the best way to do it. But it works for us

We basically don’t back up our ES data nodes, on account of there being 3 replicas, so chances of 4 nodes all going down at just the right time to lose an index or two is astronomically low. Granted, I probably just jinxed it, but eh.

For archival we use an external tool that uses Graylog’s API to query index sets and ranges, and selects indices that are no longer “required” (e.g. only contains data > 90 days old), then the selected indices are all thrown into a single S3 snapshot, once that’s done the tool calls the Graylog API to delete the indices.

It’s currently in one of those “so seriously hacked together in a few hours” states but if there’s any interest from anyone I’ll throw the thing on github after cleaning it up

Okay so that didn’t have anything to do with queries and speed but… to re-cap and tl;dr it a little:

1: More replicas increases search speed, up to a certain point.
2: More nodes with less disk is (in my humble opinion) better than less nodes with more disk.

For cloud provisioning, I think also on the long term more instances with less dedicated storage is cheaper, but that may or may not be an issue

jan · January 24, 2019, 11:21am

I don’t understand that sentence though. How do you query a stream?? You query against indices, right? Or are we mixing two discussions and are you referring to queries that are part of a stream configuration?

I guess I was to short in this sentence.

A query in a stream was refered to a search in the UI when selected a stream. Like a user with not full access would use.

@macko003 @benvanstaveren what posting should be re-opened?

macko003 · January 24, 2019, 11:25am

if @benvanstaveren wish Users feedbacks / Guides for heavy load graylog Cluster

Totally_Not_A_Robot · January 24, 2019, 12:35pm

Heh, see also the MongoU clustering and performance management courses. This is something they touch on as well: make the calculations between adding extra nodes or extra storage when you need to keep more data.

Something-something-split-data-center-strategy

macko003 · January 24, 2019, 12:54pm

just one more thing, what ES nodes addresses do you use in GL config?
Master nodes, route nodes, all?

jan · January 24, 2019, 3:34pm

opened the issue again

paging: @benvanstaveren

benvanstaveren · January 24, 2019, 4:05pm

Just the routing nodes, with sniffing disabled.

benvanstaveren · January 24, 2019, 4:06pm

Muy gracias, I’ll make a post

benvanstaveren · January 24, 2019, 4:13pm

I kind of grew into crunching the numbers back in the day where you could save a couple thousand euros by doing it - so I kept doing it. We’ve done it for other services too, our Nomad cluster that runs most of our infrastructure stuff is also made up of smaller servers, for price reason as well as redundancy (of sorts).

system · February 7, 2019, 4:17pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help in Better performance of my Elasticsearch-Graylog Setup Graylog Central (peer support)	11	4255	March 13, 2019
Enhance Graylog search performance by adding new Elastic nodes? Graylog Central (peer support)	3	1670	August 24, 2017
Terrible search performance Graylog Central (peer support)	12	2011	June 10, 2019
High load on the elasticsearch data nodes Graylog Central (peer support)	8	2966	September 27, 2019
How many nodes needed for >5TB logs in a day? Graylog Central (peer support) pipeline-rules , route-to-streampl	29	4707	March 8, 2019

How to make search queries run faster?

Related topics