Elasticsearch optimization

Hi,

My Graylog buffers and disk journal are filling up. It starts with the output buffer, then process buffer and disk journal. Input buffer is always empty. When the load gets lower, Graylog is able to clear these buffers but then it happens again.
I’m guessing that Elasticsearch is the bottleneck here and wonder how I can improve it’s performance.

My setup:
3 Graylog nodes
Graylog 4.3.8 + Mongo

3 Elasticsaearch Nodes
Elasticsearch 6.8.23

All have the same specification:
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7452 32-Core Processor
Stepping: 0
CPU MHz: 2345.602
BogoMIPS: 4691.20
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-7

Memory: 64GB

Disk: 3.0TB HDD

Elasticsearch:
Status is always green.
Active shards 900
Active primary shards 450
Avg CPU usage 40%
Avg memory usage 59GB
Avg disk usage 77%
Avg documents indexing rate 9k
Avg indexing latency 540µs
Heap size 32GB

Graylog:
rotation_strategy = count
retention_strategy = delete
elasticsearch_index_prefix = graylog
processor_wait_strategy = blocking
ring_size = 262144
message_journal_enabled = true
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 200
elasticsearch_shards = 2
elasticsearch_replicas = 1
processbuffer_processors = 17
output_batch_size = 17000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
outputbuffer_processors = 13
inputbuffer_ring_size = 262144
inputbuffer_processors = 1
inputbuffer_wait_strategy = blocking
message_journal_max_size = 25gb

Heap size = -Xms16g -Xmx32g
200 indices with a total of 4,056,075,520 messages under management
Number of logs coming in 15-30k/s
Avg CPU usage 60%
Avg memory usage 50%

I also want to increase disk size for Elasticsearch nodes to have longer retention time but need to deal with this problem first.
Is there anything I can do to increase the performance of perhaps I should add nodes to Elasticsearch cluster?

Thank you in advance!

Hi @danpop
welcome to the community. You have a quite big instance already, congratulations! I have a few thoughts:

  1. You have three nodes with Elastic with 64GB RAM each, and a Heap size of 32GB. Java has the “feature” to increase the RAM usage a lot if you supply it with 32GB RAM. Better go for double machines and half everything.
  2. your Elastic is a little older. If you upgrade make sure to find your path to Opensearch, as Graylog will use that in the long run.
  3. I gave once a very rough intro how to tune Elastic. Did you take care of the sizes of your shards?
  4. output_batch_size = 17000 is huge. I have I think 500 to 1000 at max.
  5. you have 1 inputbuffer_processor, 17 processbuffer_processors and 13 outputbuffer_processors. That are 31 in total. I understand if you not want to overprovision your CPU, but with 10-15% more I was fine so far.
  6. are all of your elastic-nodes in your graylog-config? If your graylog “talks” only to one of them it might be the bottleneck.
2 Likes

Thank you @ihe !

I think I’ll move to a new cluster with Graylog 5 and OpenSearch with higher number of nodes but lower resources.
According to your recommendation about shards, I had too many to small shards so will adjust that.
I played around with Graylog config (batch size, processors etc.) but that didn’t change anything.
I have all Elasticsearch nodes in Graylog configuration, so that’s fine.

Thank you for your hints! I think I got the direction to go to :slight_smile:

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.