Hello everyone,
I recently upgraded a Graylog 4.1.14 3-node cluster running under Java 8 (Temurin) to Graylog 4.2.12 under Java 17 (again, Temurin). The deployment is done with Debian packages on Debian 10 buster. No other update on the OS has been made.
I am observing a very big change in CPU usage for graylog-server
after the update. Whereas before, under the highest traffic, CPU usage was very low (not higher than 15% of the available on any single CPU, evenly distributed on all CPUs) now it has gone up to being up to around 50% of any CPU, again evenly distributed on all CPUs.
The following is a graph of messages (as reported by the built-in prometheus exporter) vs CPU usage (tops at 400% since it is the sum of CPU time across 4 CPUs), clearly showing how this morning with the update the behaviour is very different from yesterday.
Other performance indicators (buffers, journal) are not under any kind of stress.
This is very similar to what happened to me in a previous upgrade (4.1.1 to 4.1.9 on Java 11), which after reviewing the situation with the help of @gsmith in this thread I ascribed to Java 11 not being the intended version of Java to use with Graylog 4.1, since rolling back to Java 8 restored the “normal” performance. The previous thread also contains further information on the cluster.
The configuration of the nodes has not changed, and is largely the default from the Debian package, except for raising default UDP buffer sizes.
current configuration
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = ***
root_password_sha2 = ***
root_email = "root@example.com"
root_timezone = Europe/Rome
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 10.0.0.7:9000
trusted_proxies = 127.0.0.1/32, 10.0.0.5/32, 10.0.0.6/32, 10.0.0.7/32
elasticsearch_hosts = http://10.0.0.2:9200,http://10.0.0.3:9200,http://10.0.0.4:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
udp_recvbuffer_sizes = 4194304
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog:***@10.0.0.5:27017,10.0.0.6:27017,10.0.0.7:27017/graylog?replicaSet=graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = 127.0.0.1
transport_email_port = 25
transport_email_use_auth = false
transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@example.com
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_web_interface_url = https://graylog.example.com/
proxied_requests_thread_pool_size = 32
prometheus_exporter_enabled = true
prometheus_exporter_bind_address = 10.0.0.7:9833
I am quite puzzled at this recurring situation, and it makes me think that either I got something very wrong or I cannot see some very subtle reason for the behaviour. A brief search in the forums does not yield much (I already did it on the last occurrence) apart from a similar thread where @molnar_istvan was able to troubleshoot an high CPU usage issue by changing the values for processor_wait_strategy
and inputbuffer_wait_strategy
.
FYI, I tried changing them from blocking
to sleeping
and yielding
, but it seemingly made the CPU usage worse (CPU was always at 100% utilization). Apart from this, the server I tried this on seemed to be processing messages just fine.
Has the experience with Graylog 4.2 given the same kind of issue to any of you? I confess that at the moment I would say that what I experienced seems to point to some change, probably between 4.1.1 an 4.1.9, which made it over to 4.2, which seems to degrade performance on Java 11 and later. But I understand that this is totally circumstantial evidence, and I lack the expertise to substantiate it by looking at the code, plus I would expect to see this reported by more people if it was the case. Would it make sense, and more importantly is it supported, to roll back to Java 8 or Java 11 to verify if the version of Java does indeed play an important role, and thus it may be better to seek help from the devs?