I currently run 5 Graylog nodes on cluster on aws c5 ec2 instances. 16 CPU and 32 GB Ram.
On these machines I also run elastic coordinate only nodes.
Heap size for Graylog is 12G
Heap size for ES coordinate node 8G
http.max_content_length of elastic set to 1024Mb
Index.refresh: 15s
Graylog nodes configured to send messages to all 5 coordinate only nodes.
Coordinate only nodes are part of elastic cluster that consist of 16 data nodes and 3 separate masters.
Data nodes have 3.5 Tb NVME SSD, 16 cores and 122 GB RAM.
I set output batch size to 1000
Refresh rate: 1s
Max elastic connections to 160 and
Max connections per route 32.
Our median log flow is 15.000msg/sec
Does it make sense to raise batch size to 10.000 or it may have negative effect on performance due to very large bulk size ?
Large bulk size also may cause ES to reject it, I found on our cluster that having a batch size of about 2048 with more outputbuffer_processors can raise performance - up to a certain extent. We run 3 graylog nodes, with 8 outputbuffer_processors, 16 processbuffer_processors (on 24 core machines).
Realistically, since everyone’s setup is different the only advice I can give you is: experiment. Give it a shot and see what happens There’s currently no real “golden bullet” for larger setups.
You right, I am just wondering about rule “set batch size to your median log rate”, but it looks like it is not going to work for extremely heavy loaded setups
I think also raising the number of connections per route will work, we use 64 per route with Graylog pointed at 3 coordinating nodes, and a maximum of 3 * 64 connections (because, well, math and random reasons).
Actually I am not complaining on performance: since I set up http.max content length and index refresh properly everything works fantastic with 1000 output batch size. But since our log traffic growth I want to be ready to higher throuput