I have a GrayLog claster before the production stage.
java version: OpenJDK Runtime Environment (build 17.0.1+12-Ubuntu-120.04)
Elasticsearch version: 7.15.1
graylog version: 4.2.5
mongodb version: 3.6.9+really3.6.8+90~g8e540c0b6d-0ubuntu5.3
OS version: Ubuntu 20.04.4 LTS on a Hyper-V VM
In front of the cluster is an Nginx TCP balancer
This is a three nodes cluster behind load balancer. The log messages are arrives in TLS protected port. The TLS termination is on the GrayLog.
The traffic on the cluster really small under 10 log line per sec.
My problem is the high CPU usage and the high load. The cpu consumption of graylog process is around 300% constantly on the whole cluster. I tried to add more vCPU cores but get worse.
If in the VM was 4 vCPU i got nearly 300% load, but if in the VM was 8 vCPU i got around 700% load. So this is not a solution
As you can see the high CPU usage comes from GrayLog
Did you see anything in the log file Graylog/Elasticsearch or MongoDb that may pertain to this issue?
I found in the mongodb log file this lines:
2022-02-28T10:03:28.379+0100 I NETWORK [LogicalSessionCacheReap] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017
2022-02-28T10:03:28.383+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017
2022-02-28T10:03:28.386+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017
2022-02-28T10:03:28.389+0100 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for log-store/10.10.150.1:27017,10.10.150.2:27017,10.10.150.3:27017
This few lines are periodicaly repeats every 5 minutes. It is weird but I don’t know how problematic. I belive this is not a problem.
The Elasticsearch and GrayLog logs looks fine. No errors, no warnings.
I am not using pipeline or extractor.
Can you think the solution is the downgrade of the elasticsearch and/or Java?
I will try to downgrade the Elasticsearch in the next few days.
That last link I posted above, the other community member had a very similar issue. It was the version of Java he was using, hence Graylog uses Java. You would need to test this out in your Dev environment.
As for MongoDb log/s I found this but I don’t think its what creating the issue.
Down grading Elasticsearch would not end well. I haven’t found a way to downgrade Elasticsearch without losing data, but I think your issue may be with Java version used. Not sure about your environment if its possible to test down grading or install a different version of Java.
NOTE: I have learned when using Linux I make sure packages are pinned this will prevent issues later on.
My problem finally figured out. I apologise because i was an idiot. The java version doesn’t matter and the solution mighty easy.
In the GrayLog configuration file is this few lines:
# Wait strategy describing how buffer processors wait on a cursor sequence. (default: sleeping)
# Possible types:
# - yielding
# Compromise between performance and CPU usage.
# - sleeping
# Compromise between performance and CPU usage. Latency spikes can occur after quiet periods.
# - blocking
# High throughput, low latency, higher CPU usage.
# - busy_spinning
# Avoids syscalls which could introduce latency jitter. Best when threads can be bound to specific CPU cores.
processor_wait_strategy = ???
inputbuffer_wait_strategy = ???
I tried all of them. The blocking and yielding makes high load (5 - 8) and 350% - 400% CPU usage (on a 4 vCPU VM) the busy_spinning makes 200% - 250% CPU usage and high amount of context switches the sleeping is the solution
So after the setting of wait strategy to sleeping the CPU is calm down, around 60 - 80% stabilized. The system load under 1 beside 6 - 12 log line per sec.
Gsmith i am very grateful for your help.
Finally the versions what i use:
Java openjdk-17-jre 17.0.1+12-Ubuntu-120.04
Elasticsearch 7.15.1 (after testing i will upgrade to 8.0)
Oh Nice, I total over looked those settings.
I do about 1000mps and have mine set as follow:
processor_wait_strategy = blocking
# Size of internal ring buffers. Raise this if raising outputbuffer_processors does not help anymore.
# For optimum performance your LogMessage objects in the ring buffer should fit in your CPU L3 cache.
# Must be a power of 2. (512, 1024, 2048, ...)
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
Thanks for keeping us updated this is good to know.