Very high outputbuffer_processors count and small output_batch_size for better message throughput?

philiphauber · December 18, 2017, 12:49pm

Hi there,

TLDR: We have some settings that work quite well, but we do not understand why.

We have been using graylog for quite some time now in different environments and we really like it. In general we are trying to keep the setup simple and have all components on a single node. Our physical nodes have 20 cores(40HT), 256 GB RAM and spinning HDDs, which gives us a ok’ish performance. Although we do not need that much RAM and CPU this is our standard server hardware for larger workloads. We have log message peaks around 60k-120k messages per second, which can be handled by using the diskjomrnal. Loads around 40k messages can be processed without using the journal. Although we thought that the performance is some kind of weak, and we tried to follow the advices on how to tune graylog, like [1] and elasticsearch advices like [2], we decided that this performance will do. We have graylog versions 2.2 and 2.3, both perform similar.

graylog/server.conf

output_batch_size = 4000
output_flush_interval = 1
processbuffer_processors = 16
outputbuffer_processors =  16
processor_wait_strategy = blocking

elasticsearch.yml

# optimize for spinning disks, rather than SSD
index.merge.scheduler.max_thread_count: 1
index.translog.flush_threshold_size: 1gb
index.refresh_interval: 60s

# we don't really care about search performance. So let the indexer work as
# fast as possible without throttling
indices.store.throttle.type: none
index.merge.scheduler.max_merge_count: 16

Recently we deployed some instances to a cloud provider with 16vCPUs, 56 GB RAM and SSDs. Obviously that these machines could not match the performance we observed in our bare metal deployments. But we could not manage to get more than a message throughput of 5k messages per second. Of course the setup was altered to take less ram and less CPUs into account, but we needed to increase the throughput and started to experiment with the settings. Increasing the output batch size did negatively impact the performance so we reduced the batch size to 100, which is 1/10th of the default value. At the same time we increased the processbuffer_processors and outputbuffer_processors quite a lot. So the current settings, that seem to be a sweet spot are really the opposite to the best practices.

graylog/server.conf

output_batch_size = 100
processbuffer_processors = 72
outputbuffer_processors =  128 
elasticsearch_max_total_connections = 256
elasticsearch_max_total_connections_per_route = 256

elasticsearch.yml

# optimize for spinning disks, rather than SSD
# index.merge.scheduler.max_thread_count: 1
# index.translog.flush_threshold_size: 1gb
index.refresh_interval: 30s

# we don't really care about search performance. So let the indexer work as
# fast as possible without throttling
# indices.store.throttle.type: none
# index.merge.scheduler.max_merge_count: 16

This gives us a throughput of 50-55k messages per second. Of course all those additional threads of the processors have an impact on the cpu usage, but since we can increase the vm size easily, I think we could hit 100k messages, if needed.

Out of curiosity we deployed the same settings to our bare metal machines and they seem to work fantastic, too. Even with spinners instead of SSDs the throughput is between 90k-100k messages per second.

So my question is, why are these settings performing so much better and is it possible to increase the throughput even further before splitting elastic search to separate machines or having a graylog cluster? Is anybody seeing similar effects?
Cheers,
Philip
[1] https://www.graylog.org/blog/74-back-to-basics-from-single-server-to-graylog-cluster
[2] https://www.elastic.co/guide/en/elasticsearch/reference/master/tune-for-indexing-speed.html

philiphauber · December 18, 2017, 12:50pm

Additionally performance plots from the bare metal deployment:

jan · December 18, 2017, 3:55pm

your “tuning” was in the wrong direction.

First, processbuffer_processors, outputbuffer_processors and inputbuffer_processors should not be more than the available cores.

Second, with the second configuration you have many many small chunks of messages put by many processors (outputbuffer) into elasticsearch from Graylog. That will create many small segments and force Elasticsearch to merge them in addition to the load. In addition the bulk API connection limit of Elasticsearch might be a problem.

Let Graylog and Elasticsearch fight for the available resources work well on bare metal (most the time) but on cloud environments you will need to split them much faster.

philiphauber · December 18, 2017, 5:10pm

Thanks a lot Jan for your reply.

That’s what we thought, too. This was the reason for the initial configuration. When I reduce the processors count to something smaller than the cores count I get really low throughput.

I just re-configured one of our servers with what I think were your advices:

config a

# higher batch size for less merges
output_batch_size = 4000

# processors sum up to count of processors
inputbuffer_processors = 2
processbuffer_processors = 6
outputbuffer_processors =  8

This gives a throughput of ~8k messages, However looking into monitoring tells, that we are neither IO bound nor CPU bound. It gets better if I increase the ‘elasticsearch_max_total_connections_per_route’ parameter but the rate gets more and more volatile between 5k and 30k.

config b

output_batch_size = 5000
inputbuffer_processors = 2
processbuffer_processors = 6
outputbuffer_processors =  8
outputbuffer_processor_threads_max_pool_size = 16
elasticsearch_max_total_connections = 32
elasticsearch_max_total_connections_per_route = 32

In addition to that I have the “crazy” config, which goes way beyond standard values, which seems to perform better.

config c

output_batch_size = 5000
processbuffer_processors = 72
outputbuffer_processors =  128
outputbuffer_processor_threads_max_pool_size = 128
elasticsearch_max_total_connections = 128
elasticsearch_max_total_connections_per_route = 128

However, only when I reduce the batch size I get a good throughput.

config d

output_batch_size = 100
processbuffer_processors = 72
outputbuffer_processors =  128
outputbuffer_processor_threads_max_pool_size = 128
elasticsearch_max_total_connections = 128
elasticsearch_max_total_connections_per_route = 128

I can only upload one screenshot to the forum so, in the screenshot you can see:

config a: 16:20 - 16:28
config b: 16:30 - 16:55
config c: 16:55 - 16:58
config d: 16:59 - the end

This is the cloud vm with 16vCPUs and SSD

jan · December 18, 2017, 5:17pm

Hej @philiphauber

from me the answers will end here. Maybe someone from the community will provide their findings.

If you want to have more tuning help or insides I need to point to the professional support that Graylog provides and that is also included in Graylog Enterprise.

system · January 1, 2018, 5:17pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimize Process buffer at 100% Graylog Central (peer support)	4	1735	March 16, 2023
BlockingBatchedESOutput Graylog Central (peer support)	3	388	July 3, 2020
How can we improve output message rate to elasticsearch Graylog Central (peer support)	2	3865	May 27, 2019
Process and output buffer is 100% utilized Graylog Central (peer support)	5	9395	July 26, 2018
Explanation of Graylog data processing and performance Graylog Central (peer support)	7	14592	January 24, 2018

Very high outputbuffer_processors count and small output_batch_size for better message throughput?

Related topics