Graylog very slow

Graylog searches have become extremely slow. Just doing a 5-minute search of our firewall traffic which would have been max 10 seconds, is now at a point where it’s now taking upwards of a minute to complete.

Currently we are only running a single graylag server, that houses graylag and elastic search. We are getting average around 800 logs a sec with spikes up to 1500.

Systems is currently Ubuntu 20.04, with 32 cores, and 128GB ram. 30GB are currently allocated for elasticsearch heap. Another 30 Gb are allocated to graylog. The rest is dedicated for the system. Average cpu is hovering around 30 percent. Average memory usage is a 50 percent. Disk IO has more fluctuation, but peak was a little greater than 40MiB.

Version:5.0.5+d61a926, codename Noir
JVM:PID 2570, Eclipse Adoptium 17.0.6 on Linux 5.4.0-144-generic
Elasticsearch version: 7.10.2
1680183661 13:41:01 graylog green 1 1 999 999 0 0 0 0 - 100.0%

In attempting to resolve the issue, we have tried, restarting the service, rebooting, and updates. We have also tried modifying the server.conf file, in changing the amount output batch size, and processors.

output_batch_size = 3000

Flush interval (in seconds) for the Elasticsearch output. This is the maximum amount of time between two

batches of messages written to Elasticsearch. It is only effective at all if your minimum number of messages

for this time period is less than output_batch_size * outputbuffer_processors.

output_flush_interval = 1

As stream outputs are loaded only on demand, an output which is failing to initialize will be tried over and

over again. To prevent this, the following configuration options define after how many faults an output will

not be tried again for an also configurable amount of seconds.

output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

The number of parallel running processors.

Raise this number if your buffers are filling up.

processbuffer_processors = 8
outputbuffer_processors = 5

Any guidance as to where to go from here would be greatly appreciated.

Hello @Chase, what do your index set setups look like and how many shards total are currently being stored?

I would suggest lowering total heap dedicated to Graylog to something closer to 6GB, it doesn’t need much and at some point too much is a detriment.

Thank you for your help.

I’ll try the heap change and update to let the community know of any changes.

Total shards are 972.

Thanks,

Chase

@Chase

You are ingesting about 5-7 GB per day from your desciption. You should not need anywhere near 30 GB of heap for Graylog. However, with 972 ES shards, you actually don’t have enough heap. You should consider trimming your retained indices, reducing shards, or adding another ES/OS node.

Although this applies only to Elasticsearch versions up to 7.10.x, the general principal applies to both Elasticsearch and Opensearch.

The most important bits are here:

TIP: Small shards result in small segments, which increases overhead. Aim to keep the average shard size between at least a few GB and a few tens of GB. For use-cases with time-based data, it is common to see shards between 20GB and 40GB in size.

TIP: As the overhead per shard depends on the segment count and size, forcing smaller segments to merge into larger ones through a forcemerge operation can reduce overhead and improve query performance. This should ideally be done once no more data is written to the index. Be aware that this is an expensive operation that should ideally be performed during off-peak hours.

TIP: The number of shards you can hold on a node will be proportional to the amount of heap you have available, but there is no fixed limit enforced by Elasticsearch. A good rule-of-thumb is to ensure you keep the number of shards per node below 20 per GB heap it has configured. A node with a 30GB heap should therefore have a maximum of 600 shards, but the further below this limit you can keep it the better. This will generally help the cluster stay in good health.

So, you need to first reduce your heap for both Graylog and Opensearch, since they share the Java heap on this system. The rule of thumb is that total Java heap size should be half of system RAM, with a maximum of 31 GB. Together, GL and Opensearch cannot exceed 31 GB.

The good news is that Graylog probably doesn’t need more than 1-2GB of heap to handle your current load. Unless you are doing a lot of processing to the messages, or a ton of queries or alerts, it’s pretty efficient. The rest can go to Opensearch. You will still need to either manage shards, indices or both, unless you want to add another node, but that should get you back on track.

We are actually getting close to 100 to 110GB a day. We have Firewall, Windows Event logs, as well flat files all being logged.

What would be your recommendations. (FYI, based on a previous post, I have reduced the size of heap in graylog to 7GB.)

Thanks,

Chase

Chris’ youtube video on architecture recomendations would probably be helpful, especially if you are ingesting that much. Graylog Labs: Graylog Reference Architecture - YouTube

Thank you for the video. It was very insightful. Unfortunately, I’m limited by the hardware that I have to create my Graylog node.

With that said, one thing that I have noticed is that this didn’t seem to be a problem till about a 2 weeks ago. Before that everything was running fine, with no issues doing any searches. I’m not aware of any changes other adding dashboards, and saving queries that we have made.

From what I can tell, the ingestion seems to be fine, it’s only when we are doing queries that we are seeing the slowness.

Thanks for your help

Have you been sitting at the same retention of data in your indexes the whole time? My first thought is that the amount of data in Opensearch has grown and so it is now taking up more memory to manage all those shards, and that is causing issues.

As Chris said with that many shards you don’t have enough memory, and that number of shards is tied to the total volume of data stored. So I think there are a few options, add another opensearch node to increase total heap for opensearch, ingest less data to shrink the overall storage size, or store the data for a shorter time to reduce the overall storage size. (which will reduce the number of shards, and therefore reduce the RAM requirements)

The reason you aren’t seeing issues on write is because graylog can cache the writes so it can handle variations in performance, plus writing a message is less taxing that searching through a boatload of messages.

So without virtualizing, and or buying new hardware, what would be your recommendation? Reduce the retention of our data? Is there a way to test something like this without losing the data?

Thanks again.

There isn’t really a way to test it that I can think of. What is the total volume of data you are storing, and how long are you currently keeping it for?

Presently the server has 14TB, but is only using 3tb. This data is coming from 12 indices, with the max set to 90 days with a 1 day rotation.

Thanks,

I’d also like to point out that this is affecting our short term searches. For instance if I do a 5 min search, it’ll take 45 seconds for it to return that data. This use to not be as much of a problem with typical search results coming back within usually 5-10 seconds.

Thanks,

Currently what is the average size of an index after rotation and how many shards is the index comprised of?

The largest index I have setup is 5 shards, per rotation. Rotation is set to 1 day. Seems on average the largest index is about averaging about 22GB a day.

Thanks.

So one option you could try would be to reduce the number of shards per index. At that scale even 1 shard per index could work, you want to keep them about 20-50GB, and if you are at 22GB a day per index you are in that range. You can edit the number of shards by editing the index off of the system>indices page.

To be on the safe side maybe start by going to 2 shards, that may be a good balance.

Now that won’t give you an instant fix, you are trying to get that overall shard number down, and so it will take a week or two (as long as you still have your older indexes being deleted after 90 days) to start to see improvements in performance. The only fast fix would be to delete some of that old data to get the shard number down fast.

I will give this a try.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.