Hi,
My Graylog buffers and disk journal are filling up. It starts with the output buffer, then process buffer and disk journal. Input buffer is always empty. When the load gets lower, Graylog is able to clear these buffers but then it happens again.
I’m guessing that Elasticsearch is the bottleneck here and wonder how I can improve it’s performance.
My setup:
3 Graylog nodes
Graylog 4.3.8 + Mongo
3 Elasticsaearch Nodes
Elasticsearch 6.8.23
All have the same specification:
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7452 32-Core Processor
Stepping: 0
CPU MHz: 2345.602
BogoMIPS: 4691.20
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-7
Memory: 64GB
Disk: 3.0TB HDD
Elasticsearch:
Status is always green.
Active shards 900
Active primary shards 450
Avg CPU usage 40%
Avg memory usage 59GB
Avg disk usage 77%
Avg documents indexing rate 9k
Avg indexing latency 540µs
Heap size 32GB
Graylog:
rotation_strategy = count
retention_strategy = delete
elasticsearch_index_prefix = graylog
processor_wait_strategy = blocking
ring_size = 262144
message_journal_enabled = true
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 200
elasticsearch_shards = 2
elasticsearch_replicas = 1
processbuffer_processors = 17
output_batch_size = 17000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
outputbuffer_processors = 13
inputbuffer_ring_size = 262144
inputbuffer_processors = 1
inputbuffer_wait_strategy = blocking
message_journal_max_size = 25gb
Heap size = -Xms16g -Xmx32g
200 indices with a total of 4,056,075,520 messages under management
Number of logs coming in 15-30k/s
Avg CPU usage 60%
Avg memory usage 50%
I also want to increase disk size for Elasticsearch nodes to have longer retention time but need to deal with this problem first.
Is there anything I can do to increase the performance of perhaps I should add nodes to Elasticsearch cluster?
Thank you in advance!