High CPU usage after upgrade from Graylog 4.1 + Java 8 to Graylog 4.2 + Java 17

Hello everyone,

I recently upgraded a Graylog 4.1.14 3-node cluster running under Java 8 (Temurin) to Graylog 4.2.12 under Java 17 (again, Temurin). The deployment is done with Debian packages on Debian 10 buster. No other update on the OS has been made.

I am observing a very big change in CPU usage for graylog-server after the update. Whereas before, under the highest traffic, CPU usage was very low (not higher than 15% of the available on any single CPU, evenly distributed on all CPUs) now it has gone up to being up to around 50% of any CPU, again evenly distributed on all CPUs.

The following is a graph of messages (as reported by the built-in prometheus exporter) vs CPU usage (tops at 400% since it is the sum of CPU time across 4 CPUs), clearly showing how this morning with the update the behaviour is very different from yesterday.

Other performance indicators (buffers, journal) are not under any kind of stress.

This is very similar to what happened to me in a previous upgrade (4.1.1 to 4.1.9 on Java 11), which after reviewing the situation with the help of @gsmith in this thread I ascribed to Java 11 not being the intended version of Java to use with Graylog 4.1, since rolling back to Java 8 restored the “normal” performance. The previous thread also contains further information on the cluster.

The configuration of the nodes has not changed, and is largely the default from the Debian package, except for raising default UDP buffer sizes.

current configuration
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = *** 
root_password_sha2 = ***
root_email = "root@example.com"
root_timezone = Europe/Rome
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 10.0.0.7:9000
trusted_proxies = 127.0.0.1/32, 10.0.0.5/32, 10.0.0.6/32, 10.0.0.7/32
elasticsearch_hosts = http://10.0.0.2:9200,http://10.0.0.3:9200,http://10.0.0.4:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
udp_recvbuffer_sizes = 4194304
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog:***@10.0.0.5:27017,10.0.0.6:27017,10.0.0.7:27017/graylog?replicaSet=graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = 127.0.0.1
transport_email_port = 25
transport_email_use_auth = false
transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@example.com
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_web_interface_url = https://graylog.example.com/
proxied_requests_thread_pool_size = 32
prometheus_exporter_enabled = true
prometheus_exporter_bind_address = 10.0.0.7:9833

I am quite puzzled at this recurring situation, and it makes me think that either I got something very wrong or I cannot see some very subtle reason for the behaviour. A brief search in the forums does not yield much (I already did it on the last occurrence) apart from a similar thread where @molnar_istvan was able to troubleshoot an high CPU usage issue by changing the values for processor_wait_strategy and inputbuffer_wait_strategy.

FYI, I tried changing them from blocking to sleeping and yielding, but it seemingly made the CPU usage worse (CPU was always at 100% utilization). Apart from this, the server I tried this on seemed to be processing messages just fine.

Has the experience with Graylog 4.2 given the same kind of issue to any of you? I confess that at the moment I would say that what I experienced seems to point to some change, probably between 4.1.1 an 4.1.9, which made it over to 4.2, which seems to degrade performance on Java 11 and later. But I understand that this is totally circumstantial evidence, and I lack the expertise to substantiate it by looking at the code, plus I would expect to see this reported by more people if it was the case. Would it make sense, and more importantly is it supported, to roll back to Java 8 or Java 11 to verify if the version of Java does indeed play an important role, and thus it may be better to seek help from the devs?

1 Like

Whenever I see something like this, it makes me think of regex/GROK and how easy it is to get out of control. Not always the answer though… still, it may be worth looking into how efficient your regex/GROK are… I have posted articles before about runaway regex, here is new one I found that talks about efficiency:

https://docs.bmc.com/docs/discovery/113/improving-the-performance-of-regular-expressions-788111995.html

I don’t have enough experience for java rollbacks to give an opinion there… There is occasionally some Graylog staff that look through here (@dscryber) perhaps they have some thoughts…

Hi and thanks. I do for sure make use of GROK and regular expressions. I tend to follow some kind of KISS (i.e. mostly trying to match leading-anchored strings) but I will take a look again at my pipelines and your article to understand if I may be off somewhere.

What puzzles me, however, is that this is not a new installation. GROK patterns, regular expressions, and other configurations have not changed, only the version of graylog-server and Java did, so my reasoning is that if GROK or regular expressions are to blame, then something in one of

  • The “engine/library” that graylog-server uses for GROK or regular expressions (if it has been updated at all)
  • How graylog-server itself handles GROK or regular expressions (if something changed in the code)
  • The way the JVM runs graylog-server

is a natural suspect (i.e. not the expressions or patterns themselves).

As a curiosity, is there some way, e.g. using metrics from graylog-server, JMX, or some other method, that I could understand if graylog-server really spends most of its time processing GROK o regular expressions, and maybe drill down to a specific rule or pipeline?

In addition to the community post, I think this issue deserves to be filed as a bug report in Github, in order to (hopefully) get the attention it needs from devs.

2 Likes

Hey @gian

Sorry to hear your having issues. I do have a couple questions for ya if you don’t mind.

Is CPU usage still at 200% all the time or only during this instance?

The reason I’m asking is if it’s all the time I would definitely look into @tmacgbay & @H2Cyber statements. If not, then I question when your indices are rotated and the settings configured for “Field type refresh interval”.

JAVA:
I have rolled back JAVA versions, changed JAVA versions, etc…
The only kind of issue I had was to reconfigure my custom java settings, using the wrong version of java, and since I use JAVA default keystore " cacerts", had to adjust my certificates since the PWD has changed. The awesome thing about this is if things fail I just put the settings back and restart the service/s.
So basically I’m not deleting and download different versions needed. This is how I lab issues.

sudo update-alternatives --config java

1 Like

Hi @gsmith!

The CPU usage follows the message rate pretty closely, it is not constant. Here is an updated view of the last 48h.

If you are wondering, the excursion on graylog3 yesterday morning was me fiddling with the *_wait_strategy settings. The profile also followed message rate before the update, but usage as said was approximately 4x lower (curiously a similar amount as what seen in my 4.1.11 + Java 11 thread). If I understand correctly then you are suggesting that since it is not a constant high CPU usage, independent of the message rate, then it is expected that the *_wait_strategy have no role in it.

Regarding rolling back Java, as said, I can try. I usually do not rely on the global update-alternatives setting for Java, but explicitly configure it in /etc/default/graylog-server JAVA environment variable. As for the settings, I am also currently using the package defaults from /etc/default/graylog-server

# Default Java options for heap and garbage collection.
GRAYLOG_SERVER_JAVA_OPTS="-Xms4g -Xmx4g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"

# Avoid endless loop with some TLSv1.3 implementations.
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Djdk.tls.acknowledgeCloseNotify=true"

# Fix for log4j CVE-2021-44228
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Dlog4j2.formatMsgNoLookups=true"

which results e.g. in

gian@graylog1:~$ cat /proc/$(pidof java)/cmdline | tr '\0' '\n'
/usr/lib/jvm/temurin-17-jdk-amd64/bin/java
-Xms4g
-Xmx4g
-XX:NewRatio=1
-server
-XX:+ResizeTLAB
-XX:-OmitStackTraceInFastThrow
-Djdk.tls.acknowledgeCloseNotify=true
-Dlog4j2.formatMsgNoLookups=true
-jar
-Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml
-Djava.library.path=/usr/share/graylog-server/lib/sigar
-Dgraylog2.installation_source=deb
/usr/share/graylog-server/graylog.jar
server
-f
/etc/graylog/server/server.conf
-np

do you see any setting that needs tweaking? I do not have the need to change anything in Java’s keystore since the service is on a private network and I am not implementing TLS directly on graylog-server nodes.

JFYI, I am testing a rollback to Java 8 on one of the graylog-server nodes. So far my observations seem to indicate that the performance goes back to previous levels after the downgrade of Java. I did this comforted by the fact that all official Graylog Docker images seem to run Java 8 by default.

I do not want to jinx it though, I would like to gather some more data, e.g. in tomorrow’s morning peak, to provide clearer evidence for all to see.

Hello,

I personal don’t see anything unusual. Looks standard to me.

More or less, I was looking into the Field type refresh interval && the Index rotation strategy . These both are normally set as a time base.

Sound good, keep us informed. I personal would like to know for future troubleshooting.

1 Like

Sorry, i completely missed that question. I am only working on the default index set, and its settings are

  • Index prefix: graylog
  • Shards: 3
  • Replicas: 1
  • Field type refresh interval: 5 seconds
  • Index rotation strategy: Index Time
  • Rotation period: P1D (1 day, a day)
  • Index retention strategy: Delete
  • Max number of indices: 180

Regarding performance, here is the situation today

where the difference between the node running on Java 8 and the other two is clear as day. I will rollback all nodes to Java 8, given these results.

At least from a practical standpoint, then, I can only observe that Graylog 4.2.12 has a much worse performance running on Java 17 than on Java 8, despite both versions seeming able to run without errors.

I feel that this is a very curious situation, and would likely warrant more attention. If any of you can do more that I did to attract the attention of Graylog’s developers, I greatly appreciate it. I am particularly puzzled by the fact that no one reading these forums seems to notice this (is my installation that peculiar?) and there have been few comments on them, and also the fact that the installation guide pushes users to use Java 17. I also would like to understand why, if this is the suggested version, all Docker images are still sticking to Java 8 (though I see there are Java 11 variants), and if maybe this relates to the fact that some of this was already observed by someone.

There is only so much I can do with my setups. An independent review would be really good, though I understand not many people have the time and interest to spend on on it. I do not know if @H2Cyber suggestion still stands, but I opened the following issue on Github

3 Likes

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.