I recently upgraded a Graylog 4.1.14 3-node cluster running under Java 8 (Temurin) to Graylog 4.2.12 under Java 17 (again, Temurin). The deployment is done with Debian packages on Debian 10 buster. No other update on the OS has been made.
I am observing a very big change in CPU usage for
graylog-server after the update. Whereas before, under the highest traffic, CPU usage was very low (not higher than 15% of the available on any single CPU, evenly distributed on all CPUs) now it has gone up to being up to around 50% of any CPU, again evenly distributed on all CPUs.
The following is a graph of messages (as reported by the built-in prometheus exporter) vs CPU usage (tops at 400% since it is the sum of CPU time across 4 CPUs), clearly showing how this morning with the update the behaviour is very different from yesterday.
Other performance indicators (buffers, journal) are not under any kind of stress.
This is very similar to what happened to me in a previous upgrade (4.1.1 to 4.1.9 on Java 11), which after reviewing the situation with the help of @gsmith in this thread I ascribed to Java 11 not being the intended version of Java to use with Graylog 4.1, since rolling back to Java 8 restored the “normal” performance. The previous thread also contains further information on the cluster.
The configuration of the nodes has not changed, and is largely the default from the Debian package, except for raising default UDP buffer sizes.
is_master = true node_id_file = /etc/graylog/server/node-id password_secret = *** root_password_sha2 = *** root_email = "email@example.com" root_timezone = Europe/Rome bin_dir = /usr/share/graylog-server/bin data_dir = /var/lib/graylog-server plugin_dir = /usr/share/graylog-server/plugin http_bind_address = 10.0.0.7:9000 trusted_proxies = 127.0.0.1/32, 10.0.0.5/32, 10.0.0.6/32, 10.0.0.7/32 elasticsearch_hosts = http://10.0.0.2:9200,http://10.0.0.3:9200,http://10.0.0.4:9200 rotation_strategy = count elasticsearch_max_docs_per_index = 20000000 elasticsearch_max_number_of_indices = 20 retention_strategy = delete elasticsearch_shards = 4 elasticsearch_replicas = 0 elasticsearch_index_prefix = graylog allow_leading_wildcard_searches = false allow_highlighting = false elasticsearch_analyzer = standard output_batch_size = 500 output_flush_interval = 1 output_fault_count_threshold = 5 output_fault_penalty_seconds = 30 processbuffer_processors = 5 outputbuffer_processors = 3 udp_recvbuffer_sizes = 4194304 processor_wait_strategy = blocking ring_size = 65536 inputbuffer_ring_size = 65536 inputbuffer_processors = 2 inputbuffer_wait_strategy = blocking message_journal_enabled = true message_journal_dir = /var/lib/graylog-server/journal lb_recognition_period_seconds = 3 mongodb_uri = mongodb://graylog:***@10.0.0.5:27017,10.0.0.6:27017,10.0.0.7:27017/graylog?replicaSet=graylog mongodb_max_connections = 1000 mongodb_threads_allowed_to_block_multiplier = 5 transport_email_enabled = true transport_email_hostname = 127.0.0.1 transport_email_port = 25 transport_email_use_auth = false transport_email_subject_prefix = [graylog] transport_email_from_email = firstname.lastname@example.org transport_email_use_tls = false transport_email_use_ssl = false transport_email_web_interface_url = https://graylog.example.com/ proxied_requests_thread_pool_size = 32 prometheus_exporter_enabled = true prometheus_exporter_bind_address = 10.0.0.7:9833
I am quite puzzled at this recurring situation, and it makes me think that either I got something very wrong or I cannot see some very subtle reason for the behaviour. A brief search in the forums does not yield much (I already did it on the last occurrence) apart from a similar thread where @molnar_istvan was able to troubleshoot an high CPU usage issue by changing the values for
FYI, I tried changing them from
yielding, but it seemingly made the CPU usage worse (CPU was always at 100% utilization). Apart from this, the server I tried this on seemed to be processing messages just fine.
Has the experience with Graylog 4.2 given the same kind of issue to any of you? I confess that at the moment I would say that what I experienced seems to point to some change, probably between 4.1.1 an 4.1.9, which made it over to 4.2, which seems to degrade performance on Java 11 and later. But I understand that this is totally circumstantial evidence, and I lack the expertise to substantiate it by looking at the code, plus I would expect to see this reported by more people if it was the case. Would it make sense, and more importantly is it supported, to roll back to Java 8 or Java 11 to verify if the version of Java does indeed play an important role, and thus it may be better to seek help from the devs?