High CPU load on a three node cluster

Hi folks,

I have a GrayLog claster before the production stage.

java version: OpenJDK Runtime Environment (build 17.0.1+12-Ubuntu-120.04)
Elasticsearch version: 7.15.1
graylog version: 4.2.5
mongodb version: 3.6.9+really3.6.8+90~g8e540c0b6d-0ubuntu5.3
OS version: Ubuntu 20.04.4 LTS on a Hyper-V VM

In front of the cluster is an Nginx TCP balancer

This is a three nodes cluster behind load balancer. The log messages are arrives in TLS protected port. The TLS termination is on the GrayLog.
The traffic on the cluster really small under 10 log line per sec.

My problem is the high CPU usage and the high load. The cpu consumption of graylog process is around 300% constantly on the whole cluster. I tried to add more vCPU cores but get worse.

If in the VM was 4 vCPU i got nearly 300% load, but if in the VM was 8 vCPU i got around 700% load. So this is not a solution

I tried to change the *_processors but no luck

GrayLog config:

is_master = True # On one node True other two is False
node_id_file = /opt/gl/conf/node-id
password_secret = dylEtb5...MtSHMtAD
root_username = root
root_password_sha2 = a6dd408....fcc84851d
root_email = "<email address>"
root_timezone = CET
bin_dir = bin
data_dir = data
plugin_dir = plugin
http_bind_address = 10.10.120.31:9000
http_publish_uri = https://10.10.120.31:9000
http_external_uri = https://logstore.XXXXXX.local/graylog/
http_enable_gzip = false
http_enable_tls = true
http_tls_cert_file = /opt/gl/graylog/conf/log-idx-01.mgmt.XXXXXX.local.cert.pem
http_tls_key_file = /opt/gl/graylog/conf/log-idx-01.mgmt.XXXXXX.local.key.pem
trusted_proxies = 10.10.120.18/32, 10.10.120.17/32
elasticsearch_hosts = https://<user>:<password>@log-idx-01.mgmt.XXXXXX.local:9200,https://<user>:<password>@log-idx-02.mgmt.XXXXXX.local:9200,https://<user>:<password>@log-idx-03.mgmt.XXXXXX.local:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 2
outputbuffer_processors = 2
processor_wait_strategy = yielding
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = yielding
message_journal_enabled = true
message_journal_dir = data/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://<otheruser>:<otherpassword>@10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017/graylog?replicaSet=log-store&authSource=admin
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
proxied_requests_thread_pool_size = 32

Any idea can come
Thanks in advance

Hello && Welcome @molnar_istvan

I might be able to help.

Out of curiosity what exact process do you see when you run HTOP or TOP that may pertain to the high CPU issue?

Did you see anything in the log file Graylog/Elasticsearch or MongoDb that may pertain to this issue?
By chance do you have extractors or pipeline configured on any INPUT’s?

EDIT: I just noticed you using Elasticsearch 7.15, you may run into problems

I’m not 100% sure but this community member had an unusual issue with Java.

Hi gsmith

What exact process do you see when you run HTOP or TOP?

top:

1448 gl        20   0 4949152   1.2g  33116 S 397.7  10.1  16110:39 java

ps aux:

gl          1448  396 10.0 4949152 1226916 ?     Sl   Feb25 16142:50 /usr/bin/java -Dlog4j2.formatMsgNoLookups=true -Djdk.tls.acknowledgeCloseNotify=true -Xms1g -Xmx1g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow -jar graylog.jar server -f /opt/gl/graylog/conf/graylog.conf -p /opt/gl/graylog/graylog.pid

As you can see the high CPU usage comes from GrayLog

Did you see anything in the log file Graylog/Elasticsearch or MongoDb that may pertain to this issue?

I found in the mongodb log file this lines:

2022-02-28T10:03:28.379+0100 I NETWORK  [LogicalSessionCacheReap] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017
2022-02-28T10:03:28.383+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017
2022-02-28T10:03:28.386+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017
2022-02-28T10:03:28.389+0100 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017

This few lines are periodicaly repeats every 5 minutes. It is weird but I don’t know how problematic. I belive this is not a problem.

The Elasticsearch and GrayLog logs looks fine. No errors, no warnings.

I am not using pipeline or extractor.

Can you think the solution is the downgrade of the elasticsearch and/or Java?

I will try to downgrade the Elasticsearch in the next few days.

Hello,

That last link I posted above, the other community member had a very similar issue. It was the version of Java he was using, hence Graylog uses Java. You would need to test this out in your Dev environment.

As for MongoDb log/s I found this but I don’t think its what creating the issue.

As for…

Down grading Elasticsearch would not end well. I haven’t found a way to downgrade Elasticsearch without losing data, but I think your issue may be with Java version used. Not sure about your environment if its possible to test down grading or install a different version of Java.

NOTE: I have learned when using Linux I make sure packages are pinned this will prevent issues later on.

Hi gsmith,

My problem finally figured out. I apologise because i was an idiot. The java version doesn’t matter and the solution mighty easy.

In the GrayLog configuration file is this few lines:

# Wait strategy describing how buffer processors wait on a cursor sequence. (default: sleeping)
# Possible types:
#  - yielding
#     Compromise between performance and CPU usage.
#  - sleeping
#     Compromise between performance and CPU usage. Latency spikes can occur after quiet periods.
#  - blocking
#     High throughput, low latency, higher CPU usage.
#  - busy_spinning
#     Avoids syscalls which could introduce latency jitter. Best when threads can be bound to specific CPU cores.
processor_wait_strategy = ???
...
inputbuffer_wait_strategy = ???

I tried all of them. The blocking and yielding makes high load (5 - 8) and 350% - 400% CPU usage (on a 4 vCPU VM) the busy_spinning makes 200% - 250% CPU usage and high amount of context switches the sleeping is the solution

So after the setting of wait strategy to sleeping the CPU is calm down, around 60 - 80% stabilized. The system load under 1 beside 6 - 12 log line per sec.

Gsmith i am very grateful for your help.

Finally the versions what i use:

  • Java openjdk-17-jre 17.0.1+12-Ubuntu-120.04
  • GrayLog 4.2.5
  • Elasticsearch 7.15.1 (after testing i will upgrade to 8.0)
  • Mongodb 3.6.8
  • Ubuntu 20.04.4
  • Linux 5.11.0-1028-azure

Oh Nice, I total over looked those settings.
I do about 1000mps and have mine set as follow:

processor_wait_strategy = blocking

# Size of internal ring buffers. Raise this if raising outputbuffer_processors does not help anymore.
# For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache.
# Must be a power of 2. (512, 1024, 2048, ...)
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

No Problem,
Thanks for keeping us updated this is good to know.

For give some other feedback :

Same settings than Gsmith, and i have an average of 7000 msg/s with 4 Graylog nodes. (with spikes much higher).

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.