High CPU usage after upgrade from Graylog 4.1 + Java 8 to Graylog 4.2 + Java 17

gian · August 24, 2022, 10:50am

Hello everyone,

I recently upgraded a Graylog 4.1.14 3-node cluster running under Java 8 (Temurin) to Graylog 4.2.12 under Java 17 (again, Temurin). The deployment is done with Debian packages on Debian 10 buster. No other update on the OS has been made.

I am observing a very big change in CPU usage for graylog-server after the update. Whereas before, under the highest traffic, CPU usage was very low (not higher than 15% of the available on any single CPU, evenly distributed on all CPUs) now it has gone up to being up to around 50% of any CPU, again evenly distributed on all CPUs.

The following is a graph of messages (as reported by the built-in prometheus exporter) vs CPU usage (tops at 400% since it is the sum of CPU time across 4 CPUs), clearly showing how this morning with the update the behaviour is very different from yesterday.

Other performance indicators (buffers, journal) are not under any kind of stress.

This is very similar to what happened to me in a previous upgrade (4.1.1 to 4.1.9 on Java 11), which after reviewing the situation with the help of @gsmith in this thread I ascribed to Java 11 not being the intended version of Java to use with Graylog 4.1, since rolling back to Java 8 restored the “normal” performance. The previous thread also contains further information on the cluster.

The configuration of the nodes has not changed, and is largely the default from the Debian package, except for raising default UDP buffer sizes.

current configuration

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = *** 
root_password_sha2 = ***
root_email = "root@example.com"
root_timezone = Europe/Rome
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 10.0.0.7:9000
trusted_proxies = 127.0.0.1/32, 10.0.0.5/32, 10.0.0.6/32, 10.0.0.7/32
elasticsearch_hosts = http://10.0.0.2:9200,http://10.0.0.3:9200,http://10.0.0.4:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
udp_recvbuffer_sizes = 4194304
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog:***@10.0.0.5:27017,10.0.0.6:27017,10.0.0.7:27017/graylog?replicaSet=graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = 127.0.0.1
transport_email_port = 25
transport_email_use_auth = false
transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@example.com
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_web_interface_url = https://graylog.example.com/
proxied_requests_thread_pool_size = 32
prometheus_exporter_enabled = true
prometheus_exporter_bind_address = 10.0.0.7:9833

I am quite puzzled at this recurring situation, and it makes me think that either I got something very wrong or I cannot see some very subtle reason for the behaviour. A brief search in the forums does not yield much (I already did it on the last occurrence) apart from a similar thread where @molnar_istvan was able to troubleshoot an high CPU usage issue by changing the values for processor_wait_strategy and inputbuffer_wait_strategy.

FYI, I tried changing them from blocking to sleeping and yielding, but it seemingly made the CPU usage worse (CPU was always at 100% utilization). Apart from this, the server I tried this on seemed to be processing messages just fine.

Has the experience with Graylog 4.2 given the same kind of issue to any of you? I confess that at the moment I would say that what I experienced seems to point to some change, probably between 4.1.1 an 4.1.9, which made it over to 4.2, which seems to degrade performance on Java 11 and later. But I understand that this is totally circumstantial evidence, and I lack the expertise to substantiate it by looking at the code, plus I would expect to see this reported by more people if it was the case. Would it make sense, and more importantly is it supported, to roll back to Java 8 or Java 11 to verify if the version of Java does indeed play an important role, and thus it may be better to seek help from the devs?

tmacgbay · August 24, 2022, 2:10pm

Whenever I see something like this, it makes me think of regex/GROK and how easy it is to get out of control. Not always the answer though… still, it may be worth looking into how efficient your regex/GROK are… I have posted articles before about runaway regex, here is new one I found that talks about efficiency:

https://docs.bmc.com/docs/discovery/113/improving-the-performance-of-regular-expressions-788111995.html

I don’t have enough experience for java rollbacks to give an opinion there… There is occasionally some Graylog staff that look through here (@dscryber) perhaps they have some thoughts…

gian · August 24, 2022, 2:23pm

Hi and thanks. I do for sure make use of GROK and regular expressions. I tend to follow some kind of KISS (i.e. mostly trying to match leading-anchored strings) but I will take a look again at my pipelines and your article to understand if I may be off somewhere.

What puzzles me, however, is that this is not a new installation. GROK patterns, regular expressions, and other configurations have not changed, only the version of graylog-server and Java did, so my reasoning is that if GROK or regular expressions are to blame, then something in one of

The “engine/library” that graylog-server uses for GROK or regular expressions (if it has been updated at all)
How graylog-server itself handles GROK or regular expressions (if something changed in the code)
The way the JVM runs graylog-server

is a natural suspect (i.e. not the expressions or patterns themselves).

As a curiosity, is there some way, e.g. using metrics from graylog-server, JMX, or some other method, that I could understand if graylog-server really spends most of its time processing GROK o regular expressions, and maybe drill down to a specific rule or pipeline?

H2Cyber · August 24, 2022, 4:20pm

In addition to the community post, I think this issue deserves to be filed as a bug report in Github, in order to (hopefully) get the attention it needs from devs.

gsmith · August 24, 2022, 9:35pm

Hey @gian

Sorry to hear your having issues. I do have a couple questions for ya if you don’t mind.

Is CPU usage still at 200% all the time or only during this instance?

The reason I’m asking is if it’s all the time I would definitely look into @tmacgbay & @H2Cyber statements. If not, then I question when your indices are rotated and the settings configured for “Field type refresh interval”.

JAVA:
I have rolled back JAVA versions, changed JAVA versions, etc…
The only kind of issue I had was to reconfigure my custom java settings, using the wrong version of java, and since I use JAVA default keystore " cacerts", had to adjust my certificates since the PWD has changed. The awesome thing about this is if things fail I just put the settings back and restart the service/s.
So basically I’m not deleting and download different versions needed. This is how I lab issues.

sudo update-alternatives --config java

gian · August 25, 2022, 11:06am

Hi @gsmith!

The CPU usage follows the message rate pretty closely, it is not constant. Here is an updated view of the last 48h.

If you are wondering, the excursion on graylog3 yesterday morning was me fiddling with the *_wait_strategy settings. The profile also followed message rate before the update, but usage as said was approximately 4x lower (curiously a similar amount as what seen in my 4.1.11 + Java 11 thread). If I understand correctly then you are suggesting that since it is not a constant high CPU usage, independent of the message rate, then it is expected that the *_wait_strategy have no role in it.

Regarding rolling back Java, as said, I can try. I usually do not rely on the global update-alternatives setting for Java, but explicitly configure it in /etc/default/graylog-server JAVA environment variable. As for the settings, I am also currently using the package defaults from /etc/default/graylog-server

# Default Java options for heap and garbage collection.
GRAYLOG_SERVER_JAVA_OPTS="-Xms4g -Xmx4g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"

# Avoid endless loop with some TLSv1.3 implementations.
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Djdk.tls.acknowledgeCloseNotify=true"

# Fix for log4j CVE-2021-44228
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Dlog4j2.formatMsgNoLookups=true"

which results e.g. in

gian@graylog1:~$ cat /proc/$(pidof java)/cmdline | tr '\0' '\n'
/usr/lib/jvm/temurin-17-jdk-amd64/bin/java
-Xms4g
-Xmx4g
-XX:NewRatio=1
-server
-XX:+ResizeTLAB
-XX:-OmitStackTraceInFastThrow
-Djdk.tls.acknowledgeCloseNotify=true
-Dlog4j2.formatMsgNoLookups=true
-jar
-Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml
-Djava.library.path=/usr/share/graylog-server/lib/sigar
-Dgraylog2.installation_source=deb
/usr/share/graylog-server/graylog.jar
server
-f
/etc/graylog/server/server.conf
-np

do you see any setting that needs tweaking? I do not have the need to change anything in Java’s keystore since the service is on a private network and I am not implementing TLS directly on graylog-server nodes.

gian · August 25, 2022, 4:20pm

JFYI, I am testing a rollback to Java 8 on one of the graylog-server nodes. So far my observations seem to indicate that the performance goes back to previous levels after the downgrade of Java. I did this comforted by the fact that all official Graylog Docker images seem to run Java 8 by default.

I do not want to jinx it though, I would like to gather some more data, e.g. in tomorrow’s morning peak, to provide clearer evidence for all to see.

gsmith · August 25, 2022, 9:28pm

Hello,

I personal don’t see anything unusual. Looks standard to me.

More or less, I was looking into the Field type refresh interval && the Index rotation strategy . These both are normally set as a time base.

Sound good, keep us informed. I personal would like to know for future troubleshooting.

gian · August 26, 2022, 1:58pm

Sorry, i completely missed that question. I am only working on the default index set, and its settings are

Index prefix: graylog
Shards: 3
Replicas: 1
Field type refresh interval: 5 seconds
Index rotation strategy: Index Time
Rotation period: P1D (1 day, a day)
Index retention strategy: Delete
Max number of indices: 180

Regarding performance, here is the situation today

where the difference between the node running on Java 8 and the other two is clear as day. I will rollback all nodes to Java 8, given these results.

At least from a practical standpoint, then, I can only observe that Graylog 4.2.12 has a much worse performance running on Java 17 than on Java 8, despite both versions seeming able to run without errors.

I feel that this is a very curious situation, and would likely warrant more attention. If any of you can do more that I did to attract the attention of Graylog’s developers, I greatly appreciate it. I am particularly puzzled by the fact that no one reading these forums seems to notice this (is my installation that peculiar?) and there have been few comments on them, and also the fact that the installation guide pushes users to use Java 17. I also would like to understand why, if this is the suggested version, all Docker images are still sticking to Java 8 (though I see there are Java 11 variants), and if maybe this relates to the fact that some of this was already observed by someone.

There is only so much I can do with my setups. An independent review would be really good, though I understand not many people have the time and interest to spend on on it. I do not know if @H2Cyber suggestion still stands, but I opened the following issue on Github

github.com/Graylog2/graylog2-server

CPU performance issues with Graylog 4.2.12 + Java 17

opened 01:51PM - 26 Aug 22 UTC

closed 11:16AM - 21 Nov 22 UTC

giangi

performance bug triaged

## Expected Behavior The performance of Graylog 4.2.12 running on Java 17… (suggested by installation docs) should be comparable to what observed running on Java 8. ## Current Behavior For a more extensive discussion see the community thread: [High CPU usage after upgrade from Graylog 4.1 + Java 8 to Graylog 4.2 + Java 17 ](https://community.graylog.org/t/high-cpu-usage-after-upgrade-from-graylog-4-1-java-8-to-graylog-4-2-java-17/25432) After upgrading an installation from Graylog 4.1.14 + Java 8 to 4.2.12 to Java 17 I observed a noticeable (~4x) increase in CPU usage. Graylog and Java were the only updated parts of the installation. Replacing Java 17 with Java 8 brought back CPU usage to the levels observed before the upgrade. This was previously observed in a very similar way running Graylog 4.1.11 + Java 11. The performance drop vanished when switching to Java 8. The issue was curiously absent on 4.1.1 + Java 11. See community thread: [Graylog 4.1.x high CPU usage after updating for log4j](https://community.graylog.org/t/graylog-4-1-x-high-cpu-usage-after-updating-for-log4j/22294). ## Possible Solution I suspect, though I lack the expertise to support my claim, that some change introduced between 4.1.1 and 4.1.9 influences performance when running on Java > 8, and that this change is still in effect in 4.2.12. Apart from the obvious reproduction and fix, which would be ideal if this is confirmed to be a non-local issue, clarify the supported and suggested version of Java in the docs, and that it may have an effect on the final performance. ## Steps to Reproduce (for bugs) 1. Run a load test on a Graylog 4.2.12 + Java 17 installation, observing CPU usage 2. Run the same on a Graylog 4.2.12 + Java 8 ## Context Normal operation of a Graylog cluster mainly ingesting UDP and TCP RFC5424 Syslog, running pipelines to enrich messages with fields (GROK, JSON, regular expressions) and a few alerts. ## Your Environment * Graylog Version: 4.2.12 * Java Version: 17 and 8 * Elasticsearch Version: OSS 7.10.2 * MongoDB Version: 4.2 * Operating System: Debian 10 buster * Browser version: Not relevant

system · September 9, 2022, 1:59pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Graylog 4.1.x high CPU usage after updating for log4j Graylog Central (peer support)	13	3140	February 11, 2022
High CPU Usage Graylog 3.3.9 Graylog Central (peer support)	3	1677	August 5, 2022
[REQ] Best Java version? Graylog Central (peer support)	7	3795	February 14, 2022
High CPU usage after 3.0 -> 3.1 upgrade Graylog Central (peer support)	5	576	September 4, 2019
High CPU on GrayLog Nodes Graylog Central (peer support)	7	8223	December 19, 2018

High CPU usage after upgrade from Graylog 4.1 + Java 8 to Graylog 4.2 + Java 17

Related topics