After Upgrade from 6.0.7 to 6.1 performance degrade?

1. Describe your incident:
Last week I upgraded from 5.2.x to 6.0.7 and everything worked fine as before. Now Graylog wanted an update again and I performed upgrade to 6.1.

Then Problems arise. Before I had an output of way over 15000 logs/s, but now I get only around 5-8000 logs/s, which is way to low and my journals piling up.
I changed nothing else, same configfile as before.

2. Describe your environment:

  • OS Information:
    Rocky Linux release 9.4 (Blue Onyx)

I’m working with a cluster of 4 graylog nodes and 4 opensearch nodes.

  • Package Version:
    graylog server open 6.1
  • Service logs, configurations, and environment variables:

some previously configured values of server.conf (were working!):
processbuffer_processors = 14
outputbuffer_processors = 10
output_batch_size = 3000

3. What steps have you already taken to try and solve the problem?
Changed some values in config file (see above) as they are said to got new defaults during upgrade, but with no change.
Neither removing the entries, nor changing to different values (like batch size to 30 mb) worked.

4. How can the community help?
Anyone has the same effekt after upgrade to 6.1?
Any ideas where to look for a reason?
Maybe something has changed in general and now less ressources are available?

General question: Where can I see, how many *buffer_processors are running?

Thanks a lot for every hint!

1 Like

A small update:

After downgrade to 6.0.7 the performance seems to be like before, again.
So it looks like there is indeed some difference between 6.0.x and 6.1 which affects the performance of sending the data to opensearch or the processing of the logs - I’m not sure yet, where the bottleneck actually is - only datapoint I see is the number of logoutput/sec.

2 Likes

Hi NicoS, thanks for reporting this!

May I ask if you have CPU utilization metrics for the graylog nodes for comparison of before/after the 6.1 upgrade? We’d be interested to know whether you observed CPU utilization on your Graylog nodes increased, reduced or was unchanged?

Changed some values in config file (see above) as they are said to got new defaults during upgrade

Do I understand correctly here, that the server.conf configuration was changed or over-written during the install? (Longshot, but would it be possible to provide a sanitized before/after of this file if it was changed?)

May I ask what the underlying spec of the Graylog nodes looks like in this case?

In the meantime, we’ll perform some comparative testing of 6.0.x and 6.1 to see if we can reproduce.

Hello,
After upgrading from 6.0.7 to 6.1.1 the search is very very very long also for me.
I use a VM single node and I changed nothing but the graylog update.
How can I downgrade to 6.0.7 ?

Hi Tully,

thanks for your reply! I’ll try to provide your requested details:

  1. here are my CPU graphs of my Opensearch Nodes and the Graylog Nodes for the week before the upgrade, until/including the day the upgrade was performed. However, as I downgraded soon after realizing the issue, there is no significant change visible. The slightly increase might be a result of the workers try to reduce the filled Journal, too.


    So from my point of view there was no significant change on CPU utilization.

  2. No, the upgrade itself did NOT change the config files. They stayed the same before and after the upgrade.
    However, I saw in the release notes, that there was a change in handling the option “output_batch_size” and the default values for “processbuffer/outpubuffer_processors” are now calculated automatically. So I fugured, that may be related to my issue and tried to remove my own settings or change to other values. But none of the change had an effect on the issue.

  3. here are my specs, 8 virtual systems on a ESX Server:
    3x Graylog with 8 cores 2.2 GHz 24 GB RAM
    1x Graylog with 12 cores 2.2 GHz 24 GB RAM (more Cores for testing purpose)
    4x Opensearch 8 cores 2.2GHz 32 GB RAM

Thanks a lot!

Br
Nico

If you have explicitly configured values in your server.conf these will override any automatic or predefined values:

processbuffer_processors = 14
outputbuffer_processors = 10
output_batch_size = 3000

Should still be applying. Are you exporting metrics using the Graylog Prometheus exporter? I ask because i’m curious what your buffers look like, specifically process and output buffers for your graylog nodes. If you are not exporting these i do recommend to read through Getting-Started-with-Metrics/readme.md at main · drewmiranda-gl/Getting-Started-with-Metrics · GitHub that covers both configuring exporters (you can ignore the parts you are already familiar with such as node_exporrter) as well as links to grafana dashboards you can use out of the box.

Is it safe to assume that the change in behavior on the chart is when you upgraded from 6.0 to 6.1? (Oct 21) It does look like your cpus were running somewhat close to capacity though averaging about 75% which i think is ok.

Are you using elasticsearch as your indexer (what version?)?

Anecdotally, albeit in a smaller environment, i’m not observing any performance differences between 6.0 and 6.1, both on message ingestion and search. I’m not saying there are not any impacting changes, but that there may be variables unique to your cluster.

@NicoS

Thanks for the metrics but as you say, not much visibility here.

Any possibility you could upload a sanitized copy of your server.conf so we can seek to replicate?

Thanks Drew,
Thanks Tully,

Drew, thanks for your clarification about the config values. So it looks like the *buffer_processors values are not the culprits (I already suspected that, as changing them did not change anything in the behaviour).

We use Opensearch 2.6.0 as indexer.
(Don’t be confused: in the beginning we used Elasticsearch - the hostnames are still the same after changing to Opensearch)

This is our server.conf (without comments and anonymised). All 4 nodes are pretty identical (only node 4 has some more processbuffers, as it has more cores). Everything else is left on defaults.

is_leader = true
node_id_file = /etc/graylog/server/node-id
password_secret = *******************
root_password_sha2 = *******************
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address: 192.168.1.1:9000
http_external_uri = http://graylog.mydomain.com
stream_aware_field_types=false
elasticsearch_hosts = http://192.168.1.10:9200,http://192.168.1.11:9200,http://192.168.1.12:9200,http://192.168.1.13:9200
elasticsearch_connect_timeout = 5s
elasticsearch_idle_timeout = 60s
elasticsearch_max_total_connections = 512
elasticsearch_max_total_connections_per_route = 128
rotation_strategy = size
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 3000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 12
outputbuffer_processors = 8
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 10gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog01.mydomain.com:27017,graylog02.mydomain.com:27017,graylog04.mydomain.com:27017,graylog03.mydomain.com:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000
http_proxy_uri = http://proxy.mydomain:3128
http_non_proxy_hosts = localhost,127.0.0.1,192.168.1.*,*.mydomain.com
prometheus_exporter_enabled = true
prometheus_exporter_bind_address = 192.168.1.1:9833

Our output buffer values the last two weeks:

And the process buffers:

Unfortunately the timespan is very short, where we had the issues, as I immediately downgraded to fix it… So again there is not much visible…
Let me know, if you want a more “zoomed in” graph.

Drew, you are right, our machines are pretty high on load - I tried to utilize them as much as possible. However, the weeks before the update we had no problems, all logs could be written in time (no journal filling).

Hi NicoS

Looking at your configuration - I have a suspicion that you might have your buffer values too high Graylog side. Having too many threads can actually reduce throughput - with 8 cores, 22 threads between input/processor/output strikes me as high enough to cause Graylog to near-100% CPU usage when it receives enough traffic in a short period or has a full Journal to clear, which can result in a lot of cycles lost to contention and waits - you may actually benefit from stepping these down - perhaps try out a total closer to 16 threads.

A concept well illustrated by this grainy photograph of an e-book:

I’d suggest it might be worth using a larger Journal size if you can (not related to this issue, but it’s good practice to have 1+ day of typical ingest in sum journal size between your nodes, if you have the storage).

There is not much we can deduce from these metric screenshots unfortunately, except that it looks like the constraint is first visible at the point of outputting to OS.

May I ask some questions re: the Opensearch side?

I see the OS nodes are 32GB Ram - is ~16GB of Heap memory assigned?
May I ask what indexing strategy you use, with what configuration for shard count/replicas?
May I ask the total shard count on your OS cluster at present (visible on System/Overview)?

And just to note that we’ve got some internal performance tests running on 6.0.x vs 6.1.x at the moment, will report back here when complete.

Best
Tully

To come back here, we were unable to replicate a performance difference between 6.0.x and 6.1.x in lab conditions performing the same stress test.

Seen here, 6.0.x on the left, 6.1.x on the right.

Its speculation but upon starting 6.1.x you may have just been observing the effect of a filled journal getting emptied plus delayed OS maintenance actions like Index Optimization & Rotation kicking in when the app the started up causing a temporary drop in OS performance.