Graylog 6.1.5 performance

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:
After upgrading from Graylog 5.x to 6.1.5, there has been a degradation in output performance.


2. Describe your environment:

  • OS Information: redhat 8 x64

  • Package Version: graylog 6.1.5, elasticsearch oss 7.10.2

  • Service logs, configurations, and environment variables:

elasticsearch_connect_timeout = 10s
elasticsearch_socket_timeout = 60s
elasticsearch_max_total_connections = 1024
elasticsearch_max_total_connections_per_route = 16
elasticsearch_max_retries = 2

rotation_strategy = count
retention_strategy = delete

allow_leading_wildcard_searches = false
allow_highlighting = false

output_batch_size = 100
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

processbuffer_processors = 128
outputbuffer_processors = 64
outputbuffer_processor_threads_max_pool_size = 64

udp_recvbuffer_sizes = 16777216

processor_wait_strategy = blocking
ring_size = 65536

inputbuffer_ring_size = 65536
inputbuffer_processors = 6
inputbuffer_wait_strategy = blocking

message_journal_enabled = true
message_journal_dir = /home1/graylog-server/journal
message_journal_max_size = 360gb
message_journal_max_age = 1h

lb_recognition_period_seconds = 3
lb_throttle_threshold_percentage = 90

3. What steps have you already taken to try and solve the problem?
It was a tuning configuration that had been working without issues, and no other changes were made except for the Graylog version upgrade.

4. How can the community help?

We are currently processing an input rate of 1.5 to 2 million per second.

Our setup includes:

  • Graylog: 100 physical machines, each with 48 cores.
  • Elasticsearch (ES) Pool: A cluster of approximately 200 nodes.

Since our log input is expected to increase further, we need to scale up the output throughput. What would be the best approach to achieve this?

I experienced similar problems with upgrading from 6.0.7 to 6.1.x

Tried to discuss it there:
https://community.graylog.org/t/after-upgrade-from-6-0-7-to-6-1-performance-degrade/33908/16

However, there was no solution…

@NicoS thank you for your reply.

Are you currently running the system after downgrading to version 6.0?

Yes, after realizing the performance drop, I had to go back to 6.0.7 (or now 6.0.8 in fact).
On my side the drop was significant (more than yours), we could not output as much data as we got in… (to be honest, we are working kind of on the limit anyway…)
Yesterday I upgraded my machines to more CPU, so I plan to do another upgrade attempt soon, hoping we have now enough ressources to keep up with the intake.

We are in a situation where it is difficult to downgrade, so we are continuously testing to improve output performance.

The currently identified issue is that there is a significant difference in the load on OpenSearch between directly querying OpenSearch and searching via the Graylog dashboard.

Queries that respond within 5 seconds on the OpenSearch dashboard take more than 30–60 seconds on Graylog. While we understand that delays may occur due to widgets, the excessive load on OpenSearch is difficult to comprehend.

Also, when executing search queries, the output performance further deteriorates.

Hey @yokim,

that is surprising to hear. Do you also see the long runtime of the search when you are starting a search and removing the message count widget, just keeping the message table? Do you gather metrics for your OpenSearch cluster that could help us understand what is happening there?

Hi @yokim,

did you observe notable differences in the metrics of your Elasticsearch cluster after the Graylog upgrade, and could you share those?

Also, are you running any additional outputs?

Regarding your configuration, I’m curious as to why you’ve set output_batch_size = 100 which is very low.

Also, the process and output buffer related settings are rather high.

If you’d like to experiment with these settings to see if this improves performance in your setup, you could try to reconfigure a single node so that it uses the default settings by removing the following two settings:

outputbuffer_processors = 64
outputbuffer_processor_threads_max_pool_size = 64

which will let Graylog determine the number of processors based on the number of available CPU cores.

You could then replace

output_batch_size = 100

with

output_batch_size = 5mb

which should lead to larger batches of messages (unless your individual messages are really large).

If that already improves the performance of the output path, you could also try removing the following setting:

processbuffer_processors = 128

to let Graylog determine the number processors based on available CPU cores. Graylog will log the chosen number of processors to the server log during startup.

Hi @yokim,

I noticed that you reported this ES version above, but wrote that you are using OpenSearch (and OpenSearch Dashboards) in a later comment.

Please let us know which search backend you are using. If you are using OpenSearch, the exact version would be good to know.

Thanks!

@yokim Another question: Did you notice anything suspicious in your Graylog server logs regarding the output path, like more indexing errors or retries?