Graylog 4.1.x high CPU usage after updating for log4j

gian · January 11, 2022, 4:44pm

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:

After upgrading from 4.1.1 to 4.1.9 I noticed a significantly higher CPU usage. 4.1.11 exhibits the same behaviour. I am experiencing the issue on a three node load-balanced cluster, processing mostly only UDP syslog traffic, parsing a significant part of it (~50%) it through relatively simple pipelines (regexes, GROK and JSON) and doing some lookups with the GeoIP adapter. The system processes approximately 3k events/s at peak.

The following is a visualization of CPU usage on one node, to clarify the change in performance

As visible, the update was applied in the morning of 2021-12-03, when the CPU usage significantly changes.

2. Describe your environment:

OS Information: Debian 10 “buster”, OpenJDK 11 from OS package
Package Version: 4.1.9-4.1.11
Service logs, configurations, and environment variables: N/A

3. What steps have you already taken to try and solve the problem?

Upgrading to 4.1.11, to no avail

4. How can the community help?

Has anyone else experienced a similar performance degradation after upgrading a 4.1.x installation to 4.1.9? This current behaviour is posing a significant scalability issue for the installation.
Considering nothing apart the graylog-server package has been updated, does anyone have a suggestion on what could be the culprit, or how could it be possible to make the performance return to the previous levels?

Thanks in advance for your help!

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

gsmith · January 11, 2022, 11:58pm

Hello && Welcome

I noticed your peak CPU is once a day I think at noon. I need to ask some additional questions.

What is your Index rotation strategy & Rotation period?
How does that graph above compare to the other nodes at the same date/time?
When you performed an upgrade on Graylog was there other updates installed also?
By chance what does the journal look like and the buffers ( input, process, output) when the CPU spikes?
What version of Elasticsearch is installed?
Did you find anything in ES, GL, or MongoDb log files that would pertain to this issue when the CPU spikes?

gian · January 12, 2022, 10:13am

Hi, thank you for your time in reading and asking questions. Here are some answers:

My rotation strategy is time-based, the rotation is daily, and from the logs I see it happening at 01:00 AM, when the system is rather idle.
All graylog-server nodes show identical metrics
No, I did not update any other package on the system
During the morning hours, when activity is at peak, the journal reports some hundred messages steadily flowing through, with very low percentage usage. All buffers are absolutely empty most of the time.
I am running this Graylog installation against a elasticsearch-oss 7.10.2 cluster, which is built on rather oversized hardware for this load and does not seem to have any high usage indication. I see no significant changes on it around the update except that raw network traffic towards ES nodes seems to have gone down a little, especially while off-peak.
No, unfortunately
- I see no significant changes in the logs from Graylog, which contain mostly warnings due to JSON parser failures (I have some logs per minute which I attempt to parse as JSON but are not actually JSON).
- MongoDB logs contain mostly non-threathening ACCESS and NETWORK logs
- Elasticsearch only usually logs about the daily index creation for the rotation

I would like to add that it does not seem to me that there are “spikes” of activity, but rather that the CPU costs of operating graylog has raised 4-fold since the upgrade. The load profile you see is normal and related to the type of logs being processed which follows user activity, which has two daily peaks, and follows working days. This load generally has no true bursts or spikes, and is rather continuous in nature.

I hope this provides some more clues, and thank you and everyone again for anything you might be able to suggest to help.

gsmith · January 12, 2022, 11:56pm

Hello,
Judging from this picture it seems something is going on perhaps twice a day. Hard to see picture, but at 12PM and maybe around 1 PM, Is this correct? Then it repeats it again, same time different day. The CPU drops just before midnight. I believe you stated this is “during working days”.

When I see Identical metrics on different days, I wonder what else is being executed during this time. This may help troubleshoot the issue.

Another note: the two peaks I see that are repeated – Busy User (The time the CPU has spent running users’ processes that are not niced.) && – Busy Iowait time (Amount of time the CPU has been waiting for I/O to complete.) From the picture above, if I’m correct on the times. Your CPU/s are having a real hard time keeping up between 12PM and 1 PM.

What I don’t understand this just started after you upgraded ONLY Graylog from 4.1.1 to 4.1.9. I assume you did not reboot the system. You can look here to see if you can find your issue.

https://docs.graylog.org/docs/changelog

If this was myself, I would monitory any processes during that time of the CPU peak as shown in the graph. Maybe use TOP or HTOP and show what is using those resources.

By chance are you using Nginx or something similar?

If you could post your top Processes when this happens during peak hours that would be great.

I have seen extractors increase CPU, bad regex , or over sharding.

EDIT: Is it possible to show your Graylog configuration file (please replace your personal info when posting) and what resources on your Graylog server (CPU. Memory, etc…)?
If I’m understanding you correct you have a three node Elasticsearch cluster and one Graylog node? If this is incorrect what architecture setup do you have in this environment?

gian · January 13, 2022, 7:46pm

Hi, thanks again for your time!

You are correct in characterizing the behaviour of the system, and it is true that currently the Graylog servers are nearing “CPU exhaustion” during the hottest morning peaks. Before answering your questions I would however first like to underline some of my previous points because I feel that I may have expressed myself in a not-so-clear way.

I think the load profile is not worrying per se. I can assure that it is a pattern specific to user activity on the application logging to Graylog, which has (and always had) that “spiky” profile. In other words, the spikes exactly correlate with the rate of logs coming in and user activity. You can see that the pattern is present even before the update on 2021-12-03. Across that date (sorry if I repeat myself!):

Log rates at the morning and afternoon peak did not change significantly (not 3x-4x certainly!)
The “composition” of logs and logging applications did not change
Nothing in Graylog’s configuration (input, pipelines, regular expressions, GROK patterns) changed

the only configuration option I changed was adding the Java option -Dlog4j2.formatMsgNoLookups=true as suggested the Graylog Update for Log4j and found in the new default configurations.

My intuition is that when there is traffic the CPU usage simply has magnified by around 3x-4x. I can concede that maybe during the night or outside working days the magnification is less apparent. This I cannot clearly explain, but it may point toward the fact that what changed has something to do with how Graylog acts on high concurrency?

With these observations out of the way, here are my answers:

I did read the changelogs. I admit that I did not find any smoking gun but not knowing how Graylog works, maybe I am underestimating some points or missing them entirely!
I do not think that something in the system other than the graylog-server process is consuming resources heavily, but I will certainly take a peek at it in real time tomorrow morning as you suggested.
I did not fully reboot the Graylog servers. I will give it a try tomorrow afternoon.
Mine is a convoluted setup, but I will try my best to be brief:
- I run 3 identical graylog-server nodes. These run on virtual machines (4 vCPUs, 8 GB RAM) on an on-premises virtualization cluster based off 48 logical core hypervisor hosts. I see no correlation between the virtualization cluster usage and the Graylog performance degradation.
- MongoDB runs as a replica set on the same VMs as Graylog
- Yes, the Graylog web interface and APIs served through Nginx. I would like to add however that only a few clients access the web interface, and relatively sparingly. I.e., I do not think operator use is what is causing the issue. I do not use APIs frequently.
- I also run for full disclosure an homebrew load balancing/VIP setup based on keepalived and HAProxy on the same VMs as Graylog and MongoDB, as a front for the web interface, APIs, and syslog UDP and TCP ingestion. Syslog UDP ingestion dwarfs all other uses.
- I run Elasticsearch on three physical dedicated nodes with 20 logical CPU cores and 64 GB RAM each, with storage on SSDs. Performance parameters coming from Elasticsearch nodes seems to indicate that they are only very lightly used. The cluster is dedicated to Graylog.
I do not use extractors, only pipelines
I understand that regexes and pipelines are the most frequent culprits for performance degradation, but I assure you that no change at all happened in those configurations.

Lastly, following is an lightly edited configuration from the Graylog master as an example. I hope it is of use, performance parameters mostly come from 4.1.x defaults

Configuration file

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = xxx
root_password_sha2 = xxx
root_email = "root@example.com"
root_timezone = Europe/Rome
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 10.1.2.11:9000
trusted_proxies = 127.0.0.1/32, 10.1.2.11/32, 10.1.2.12/32, 10.1.2.13/32
elasticsearch_hosts = http://10.1.2.21:9200,http://10.1.2.22:9200,http://10.1.2.23:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
udp_recvbuffer_sizes = 4194304
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog:xxx@10.1.2.11:27017,10.1.2.12:27017,10.1.2.13:27017/graylog?replicaSet=graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = 127.0.0.1
transport_email_port = 25
transport_email_use_auth = false
transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@example.com
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_web_interface_url = https://graylog.example.com/
proxied_requests_thread_pool_size = 32
prometheus_exporter_enabled = true
prometheus_exporter_bind_address = 10.1.2.11:9833

gsmith · January 14, 2022, 12:04am

Hello @gian

Your setup is pretty impressive

I’m running GL 4.2 and executed the same configuration as you did Dlog4j2.formatMsgNoLookups=true
with 35GB of logs a day on 12 cores,10 GB of RAM. I don’t have as big of a setup as you do but I didn’t see a increase of CPU usage in Zabbix. This issue is kind of odd and I’m curious why it is happening to you.

I would be keen to know what processes are running during the peak times.
Maybe TAIL some logs if you decide to reboot.

Another idea that I was think of is when you upgraded Graylog are all your Plugins the same Version as your Graylog Server/s?

If this was my setup I would double check if it was the configuration made by removing it and restarting the service. Of course I would put in place security option since this is a zero day vulnerability.

Thank you for added details.

EDIT: I’m not sure how much logs your ingesting in a day but here is my configuration file for Graylog. My other statistic are shown above. Perhaps you can compare your Graylog server configuration file with mine. I think the real difference is that I’m using certs and you separated you ES from Graylog/MongoDb nodes. Also this is a single node.

GL Config

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret =some_sting
root_password_sha2 =
root_email = "greg.smith@domain.com"
root_timezone = America/Chicago
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = domain.com:9000
http_publish_uri = https://domain.com:9000/
http_enable_cors = true
http_enable_tls = true
http_tls_cert_file = /etc/pki/tls/certs/graylog/graylog-certificate.pem
http_tls_key_file = /etc/pki/tls/certs/graylog/graylog-key.pem
http_tls_key_password = secret
elasticsearch_hosts = http://8.8.8.8:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 5000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 7
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 12gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://mongo_admin:mongo_password@localhost:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = localhost
transport_email_port = 25
transport_email_subject_prefix = [graylog]
transport_email_from_email = root@domain.com
transport_email_web_interface_url = https://8.8.8.8:9000
http_connect_timeout = 10s
proxied_requests_thread_pool_size = 32
prometheus_exporter_enabled = true
prometheus_exporter_bind_address = graylog.domain.com:9833

Hope that helps.

gian · January 14, 2022, 5:05pm

Hi,

regarding the ingestion, the system generally handles ~50GB per day.

As suggested, I took a peek at processes running on one of the nodes this morning, I went for

top -b -d 60 | grep -A 10 -F PID

over the hottest hour, and the outputs were all like

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
26247 graylog   20   0 8268596   4,4g  84420 S 275,4  57,1   3214:56 java
  512 mongodb   20   0 2203624 323536  41216 S   4,3   4,0   8546:17 mongod
 4246 root      20   0       0      0      0 I   0,7   0,0   0:02.78 kworker/2:2-events
 4173 node-exp  20   0  719380  22296  12088 S   0,1   0,3 296:52.89 node_exporter
 4462 root      20   0       0      0      0 I   0,1   0,0   0:00.09 kworker/3:2-events_freezable_power_
28744 haproxy   20   0   15700   2740   1152 S   0,1   0,0  43:33.34 haproxy
   10 root      20   0       0      0      0 I   0,0   0,0  67:27.42 rcu_sched
  248 root      20   0       0      0      0 S   0,0   0,0  69:31.40 jbd2/dm-0-8
  515 Debian-+  20   0   42900  14500   9600 S   0,0   0,2  72:53.43 snmpd
    1 root      20   0  170776  10636   7904 S   0,0   0,1  70:21.02 systemd

kind of confirming that Graylog is the greatest CPU user. I also reviewed buffer and journal utilization using the built-in prometheus exporter, and can confirm that buffers are always empty, and the journal never fills over 2%.

In the afternoon, I tried a full rolling shutdown and cold restart of all VMs, but this did not change any of the performance parameters.

Finally, I tried restarting one on the graylog-server instances without -Dlog4j2.formatMsgNoLookups=true and left it running for some time, but the performance of the node matched exactly the one of the nodes running with the option enabled.

I took my time to be sure to resolve the JSON parsing warning from one of my pipelines and since then I see nothing interesting in the logs except for index rotation. Upon restarting, I mostly get informational messages, except for the warnings

2022-01-14T16:17:16.460+01:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0xdd42cb3b, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.460+01:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0xb8d7f8de, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.462+01:00 WARN  [AbstractTcpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogTCPInput{title=Syslog TCP, type=org.graylog2.inputs.syslog.tcp.SyslogTCPInput, nodeId=null} (channel [id: 0x92f75b38, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.470+01:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0x5a74b7d2, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.
2022-01-14T16:17:16.476+01:00 WARN  [UdpTransport] receiveBufferSize (SO_RCVBUF) for input SyslogUDPInput{title=Syslog UDP, type=org.graylog2.inputs.syslog.udp.SyslogUDPInput, nodeId=null} (channel [id: 0xf49687f1, L:/0:0:0:0:0:0:0:0%0:1514]) should be 4194304 but is 8388608.

which do not worry me (I set 4194304 as buffer size in the inputs since the default once was low and caused me headaches by dropping messages, but I do not think that having them larger is a problem!) and the following logs from the JVM in systemd’s journal

gen 14 16:17:02 graylog1 systemd[1]: Started Graylog server.
gen 14 16:17:03 graylog1 graylog-server[501]: OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
gen 14 16:17:04 graylog1 graylog-server[501]: WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: An illegal reflective access operation has occurred
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: Illegal reflective access by retrofit2.Platform (file:/usr/share/graylog-server/graylog.jar) to constructor java.lang.invoke.MethodHandles$Lookup(java.lang.Class,int)
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: Please consider reporting this to the maintainers of retrofit2.Platform
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: All illegal access operations will be denied in a future release

Regarding plugins, I did not install any. I guess then that I should not be worried about the version mismatch you suggested to check?

I compared my configuration with yours, and I do not see many differences. You have some performance parameters (from the little I understand) that have been raise to better suit the fact that you have more cores, and do bigger output batches to Elasticsearch, but I do not see any relevant difference.

I admit I have very few further steps in mind except maybe:

Upgrading all pending OS packages (I am behind some bugfix releases for both OpenJDK and MongoDB). I have low hopes on this, but at least it should not make things worse, and I will someday need to apply bugfix updates anyway.
Trying to upgrade to Debian bullseye, Graylog 4.2 and OpenJDK 17. This will take me quite some time however, since it is a big leap forward.
Trying to go back to OpenJDK 8. I never had a problem with OpenJDK 11 on this installation, and docs seems to suggest that it is compatible since Graylog 3.x, but still the docs say that OpenJDK 8 is the official requirement, so maybe I ran out of my luck using OpenJDK 11!

One other thing I am worried about is that maybe I should look better into JVM performance parameters? I would like e.g. to understand if the heap allocated to Graylog is doing fine, or if the load may be due to some overstress in things such as GC. This however is also quite out of my expertise, I am only aping other most commonly heard horror stories without knowing if they also can affect Graylog!

gsmith · January 14, 2022, 11:21pm

Hello,

Glad this is confirmed 100% and thank for testing this out.

So, with or without the configuration made Dlog4j2.formatMsgNoLookups=true you still have a problem with CPU. So something else had to change if this was normal before and all you executed is

apt update && apt upgrade graylog-server

Correction, you have default ones. Navigate to System/Nodes . Click your node/s.
You should see this…

Which should match your Graylog Version installed. By default you have Plugins they may be required.
This is shown in the link below.

https://docs.graylog.org/v1/docs/plugins

Did some research for you, not sure if you have seen this.

gen 14 16:17:11 graylog1 graylog-server[501]: WARNING: Illegal reflective access by retrofit2.Platform (file:/usr/share/graylog-server/graylog.jar) to constructor java.lang.invoke.MethodHandles$Lookup(java.lang.Class,int)

github.com/Graylog2/graylog2-server

WARNING: An illegal reflective access operation has occurred

opened 03:06AM - 19 Jan 20 UTC

closed 01:38PM - 04 Mar 21 UTC

alias454

bug triaged java9

When using OpenJDK 11 on CentOS 7, I see warnings about illegal reflective acces…s operations that show up after startup. Output from systemctl -l status graylog-server: ``` ● graylog-server.service - Graylog server Loaded: loaded (/usr/lib/systemd/system/graylog-server.service; enabled; vendor preset: disabled) Active: active (running) since Wed 2020-01-15 00:35:06 CST; 3 days ago Docs: http://docs.graylog.org/ Main PID: 17354 (graylog-server) CGroup: /system.slice/graylog-server.service ├─17354 /bin/sh /usr/share/graylog-server/bin/graylog-server └─17356 /usr/bin/java -Xms2g -Xmx2g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:-OmitStackTraceInFastThrow -jar -Dlog4j.configurationFile=file:///etc/graylog/server/log4j2.xml -Djava.library.path=/usr/share/graylog-server/lib/sigar -Dgraylog2.installation_source=rpm /usr/share/graylog-server/graylog.jar server -f /etc/graylog/server/server.conf -np Jan 15 00:35:06 graylog systemd[1]: Started Graylog server. Jan 15 00:35:07 graylog graylog-server[17354]: WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. Jan 15 00:35:12 graylog graylog-server[17354]: WARNING: An illegal reflective access operation has occurred Jan 15 00:35:12 graylog graylog-server[17354]: WARNING: Illegal reflective access by com.google.inject.assistedinject.FactoryProvider2$MethodHandleWrapper (file:/usr/share/graylog-server/graylog.jar) to constructor java.lang.invoke.MethodHandles$Lookup(java.lang.Class,int) Jan 15 00:35:12 graylog graylog-server[17354]: WARNING: Please consider reporting this to the maintainers of com.google.inject.assistedinject.FactoryProvider2$MethodHandleWrapper Jan 15 00:35:12 graylog graylog-server[17354]: WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations Jan 15 00:35:12 graylog graylog-server[17354]: WARNING: All illegal access operations will be denied in a future release ``` ## Context I have not done any testing on this and the service seems to startup and run. ## Your Environment Graylog server 3.1.4+1149fe1 JRE: Oracle Corporation 11.0.5 on Linux 3.10.0-1062.9.1.el7.x86_64 Deployment: rpm OS: CentOS Linux 7 (Core) (centos) Arch: amd64

I completely understand wanting to upgrade your Graylog version, specially for security reasons.
I have actually put some time in researching this, most I could see with the same problem from the community members either here on Graylog Community or GitHub

Java version/s.
Someone else resolved there issue because of too many fields and Chrome,
As mentioned before the Plugins were not the same version.
Extractors as you know already.
Elasticsearch batch size had to be adjusted.
Permission on configuration files after upgrade were change

During my time in the community I have seen most members not pinning there package versions and installing version beyond what the documentation stated which leads to issues.
From what I seen of your environment It looks like you basically using a default configuration and architecture setup.

If you do Upgrade please look here for the correct versions and packages. This documentation is the most current version of Graylog 4.2.

https://docs.graylog.org/docs/debian

If you still cant resolve this issue you may want to post your issue here.

At this point all I can suggest is try what you can and post your findings. What I’m thinking is something might not have been working correct before and once you upgraded Graylog then it started working or it could have been the opposite. Either way it’s JAVA/Graylog. Just make sure you do a thorough check on any and all service/s on graylog node/s and configurations. I have noticed when upgrading Graylog some configuration files have changed. Check twice, correct once. If you can keep us informed about this issue that would be great.

EDIT:
I just came across this seams like this is a problems with openjdk

https://bugs.openjdk.java.net/browse/JDK-8222942

github.com/scalameta/metals

High CPU usage on OpenJDK 11 (only)

opened 02:23PM - 03 Jul 20 UTC

closed 01:41PM - 13 Jul 20 UTC

voidcontext

needs more information

**Describe the bug** When using metals with OpenJDK 11, metals becomes very slo…w and it's CPU usage is very high after opening a large project (e.g. http4s). There's no need to do anything, it is possible to trigger this behaviour just by opening a file from the project and leave the editor open (emacs in my case), the CPU consumption will remain high and constant even after 20-30mins. With the same setup switching back to OpenJDK 8 (and reinstall coursier, metals) solves the problem. **To Reproduce** Steps to reproduce the behavior: 1. Install metals using OpenJDK 11 2. Open a file from a large project (e.g. http4s) 3. Leave the editor open 4. See error **Expected behavior** CPU consumption should go back to normal after the initial index + build cycle. **Screenshots** N/A **Installation:** - Operating system: macOS - Editor: Emacs - Metals version: 0.9.0+199-1e6cda24-SNAPSHOT, v0.9.1, 0.9.1+6-38226259-SNAPSHOT - JDK: openjdk version "11.0.1" 2018-10-16 LTS OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode) **Additional context** Metals's been built via a custom nix derivation using coursier (can provide details if necessary) **Search terms** OpenJDK / Java 11, high CPU usage / consumption

gian · January 17, 2022, 12:11pm

Hi again! Thank you again for your suggestions.

I have just tried switching to Temurin 8 on one of the nodes, now that the morning peak is subsiding, and I must say that the results seem to be encouraging. Its CPU usage, while ingesting the same amount of traffic as the other nodes, has incredibly lowered:

I will monitor the situation during the afternoon peak, maybe also in tomorrow morning’s peak, but if this holds I will switch all hosts to Temurin 8. Unfortunately I could not find any easier way to get OpenJDK 8 on Debian buster and manually dropping and updating is a chore, so I welcome any suggestions if someone had better mileage with some specific OpenJDK installation method.

I will post updates and an eventual confirmation of the resolution once I am more confident that this is not a fluke.

gsmith · January 17, 2022, 10:41pm

Hello,

Nice ,

So it seams that OpenJDK 11 may have a bug like the one I noticed in GitHub
Can you enlighten me on Temurin, From what I read its binaries that the Eclipse Foundation produces. Then it goes on stating “successor binaries to AdoptOpenJDK”. Is this a fork? Then I read its Adoptium/Temurin name split?

When I need to change JAVA versions I execute this command which display any JAVA version I have installed.

sudo update-alternatives --config java

Then I select the JAVA number needed. In this case its #2

That would be great

gian · January 18, 2022, 4:11pm

Hi again!

I can confirm that AFAIK switching to Java 8 solved the issue. Today I switched all the other servers too, restoring CPU usage to levels that seem comparable with the ones I was accustomed to.

I cannot say unfortunately, maybe someone from Graylog may comment on this. Given that all started with the Graylog update (not with an OpenJDK update), my impression as an outsider is that it may well be that the Graylog’s 4.1 support for Java 11 was never more than experimental (as stated in the docs), and some code update between 4.1.1 and 4.1.9 took with it something that made some performance-affecting incompatibility rear its ugly head. Logs were hinting at the possibility that there may be something at play, and maybe everything was tested on Java 8, so it was not detected until it hit a real world installation hard enough. I probably should have followed the installation guide more thoroughly, and if someone from Graylog finds it useful, I would suggest to consider removing the bits about Java 11 support in the archived docs, or making the risks of running it more evident.

It was a new find for me too. I came across them while looking for AdoptOpenJDK. which I also remembered the initiative being one of the sanest ways to easily install a “tested” OpenJDK binary distribution from some years ago. From what I can tell, the AdoptOpenJDK project simply went under the umbrella of the Eclipse foundation and rebranded because of shared interests and benefits. Temurin is the new name of the project and distribution, Adoptium is the higher level project handling Temurin, both are inside the Eclipse projects ecosystem. I am however completely ignorant on the matter apart from this!

As said, I chose them as an alternative because unless I lost something, Debian buster has only one true alternative, which is openjdk-11-jre-headless. There is an additional “zero” variant, which does not seem suited to my use case, and also a non-free JRE 8 from NVIDIA which is also a no-go for me (openly stated as obsolete for compatibility for some NVIDIA applications, and non-free). Adoptium unfortunately does not yet provide packages so I went for the ugly solution of dropping the tarball on the filesystem manually.

Regarding the method for the Java switch, Debian (and other distros) would certainly like users use the alternatives framework as you suggest. I think that the framework is the way to go to provide a system-wide or default setting, and is generally well-integraded with distributions that provide multiple alternative Java versions, but I saw no reason to shy away from providing a specific JRE to a specific application for performance or stability benefits. If I wanted to hook my manual Temurin installation up to the alternatives framework, I certainly could with some busywork, but I did not see any advantage in my case since Graylog is the only java application on the servers. Fortunately /etc/default/graylog-server provides a way to directly select the java binary with the JAVA variable, thus in my case I simply set

JAVA=/opt/temurin/jre8/bin/java

and Graylog started using that instead of the default, alternatives-managed /usr/bin/java, while I still have the Debian-provided OpenJDK 11 as a more easily maintained default.

I am marking the topic as solved! It has been a strange ride, but all is well that ends well. Thank you for you continued support!

alessio.dapelo · January 26, 2022, 9:51am

@gian
Which package specifically have you installed?
Could you please indicate it to me?

gian · January 28, 2022, 4:02pm

Hi @alessio.dapelo, I am currently running without issues on:

graylog-server 4.1.11-1 (Debian package)
Eclipse Temurin jdk8u312-b07 JRE from Latest release | Adoptium - Open source, prebuilt OpenJDK binaries
Please note that there seems to be an update (jdk8u322-b06) for Temurin

system · February 11, 2022, 4:02pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High CPU usage after upgrade from Graylog 4.1 + Java 8 to Graylog 4.2 + Java 17 Graylog Central (peer support)	9	1186	September 9, 2022
High CPU Usage Graylog 3.3.9 Graylog Central (peer support)	3	1674	August 5, 2022
[REQ] Best Java version? Graylog Central (peer support)	7	3793	February 14, 2022
After Upgrade from 6.0.7 to 6.1 performance degrade? Graylog Central (peer support)	21	564	March 10, 2025
High CPU on GrayLog Nodes Graylog Central (peer support)	7	8223	December 19, 2018

Graylog 4.1.x high CPU usage after updating for log4j

Related topics