High CPU on GrayLog Nodes


(Khalique Zafar) #1

We have 3-node graylog cluster running graylog server and mongodb on all three nodes(VMs) and 1 nginx loadbalancer(VM) to loadbalance log traffic as well
web UI accessand elastic search in installed on seperate 4-node Elastic cluster.

We are facing high CPU usage process java (graylog) someimes more 500% as you cvan see below in extract of top command on one node.

top - 08:00:21 up 20:24, 2 users, load average: 6.53, 5.83, 6.65
Tasks: 221 total, 1 running, 220 sleeping, 0 stopped, 0 zombie
%Cpu(s): 59.2 us, 0.2 sy, 0.0 ni, 30.0 id, 10.4 wa, 0.0 hi, 0.1 si, 0.0 st
KiB Mem : 32773516 total, 570936 free, 8929288 used, 23273292 buff/cache
KiB Swap: 2097148 total, 2097148 free, 0 used. 23390284 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
20457 root 20 0 13.844g 7.981g 21704 S 508.2 25.5 5585:04 java

Below is the configuration of graylog infrastructure:

–3 x nodes VMs running in VMware vsphere ESx6.x
Hardware Specification of each node
4 vCPUs 2.3Ghz
32GB RAM
Flash disk: 500GB

Software Specification:
OS: Red Hat Enterprise Linux Server release 7.4 (Maipo)
GrayLog Server: Graylog 2.4.3+2c41897
Mongodb: db version v3.6.2
git version: 489d177dbd0f0420a8ca04d39fd78d0a2c539420
OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
Java: java version "1.8.0_161 Java™ SE Runtime Environment (build 1.8.0_161-b12) Java HotSpot™ 64-Bit Server VM (build 25.161-b12, mixed mode)

–Elastic Search is running on 5.7.x on separate 4 node cluster

Below is the /etc/graylog/server/server.conf for master node and configuration for secondary/slave nodes are also same except is_master = flase .

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = SbDz5kYSqpZ2jj18Nqvn80mflIkIbPMSggsNAK4UgDeNF73k8AgVnXrKBUWrFAxxiwfFf360cMegeqNEupFTtfs61PCux460
root_password_sha2 = 91aa480056871283357058827b45a528942cf2ada69b312575fa1898d9589f6c
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://log01.kz.local:9000/api/
rest_transport_uri = http://log01.kz.local:9000/api/
rest_enable_cors = false
rest_tls_cert_file = /etc/graylog/server/certificates/graylog.crt
rest_tls_key_file = /etc/graylog/server/certificates/graylog.key
trusted_proxies = 127.0.0.1/32, 0:0:0:0:0:0:0:1/128,10.237.95.0/32
web_listen_uri = http://log01.kz.local:9000/
web_enable_cors = false
elasticsearch_hosts = https://elastic:HVR8exrqVa1Qqkq6OAa5ykNP@4df4b80500ff4e1eab3b7e2e4e783564.kz.local
elasticsearch_connect_timeout = 30s
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 2000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 10
outputbuffer_processors = 10
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /app1/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://log01:27017,log02:27017,log03:27017/graylog?replicaSet=rs-db01
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_web_interface_url = https://logexplorer.kz.local
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

JVM setting for graylog in /etc/sysconfig/graylog-server

GRAYLOG_SERVER_JAVA_OPTS="-Xms8g -Xmx8g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"


(Tess) #2

First off, two things not related to your problem:

  1. When pasting large blocks of code, like your config file, please use code-blocks to keep things legible.
  2. You just pasted your whole production configuration, including your admin password, your password crypt secret string and a bunch of details that provide information about your network. That’s not a good idea.

(Ben van Staveren) #3

You may want to tweak your processor thread counts in the Graylog server configuration file. (processbuffer_processors, outputbuffer_processors, inputbuffer_processors). The sum of all 3 should not exceed the number of available cores in your setup. I’m not sure whether vSphere exposes 4 vCPU’s as having 8 cores, but your configuration already goes up to 22 processors total.

Ideally you’ll want to set processbuffer_processors to 4, 2 outputbuffer_processors and 2 inputbuffer_processors on your current setup. Or, scale up to 12 vCPU’s in which case you can keep your configuration as-is. This also, of course, depends on how many messages/sec you’re putting in to Graylog, and how heavy your pipelines/extractors/etc. are.


(Khalique Zafar) #4

Thanks for pointing out. Actually it’s not my production setup. I emulated the issue on staging environment and share that configuration. Noted with thanks I ll make sure to use code blocks in future.


#5

Also, you did not provide info on messages/second ingested, or whether you use regexes in extractors or pipelines. Graylog can consume a lot of CPU cycles if you configure inefficient regexes.


(Ben van Staveren) #6

Just as a hint on regular expressions and even Grok patterns, always try to anchor them to the beginning of a string with ^ - even if you need something from the middle, it’s often faster that way. If you already know in advance you need to do many regexes to extract data from the same field, either ensure the logging application supplies it in an easier to handle format, or alternatively, use the split function to split the message into separate parts, then run your regexes on the parts you need, store those results, then get rid of all the separate parts if you no longer need them.


#7

you can monitor the processing times undes System/nodes - metrics
check the running time of pipelines, streams, and extractors.
with https://regex101.com/ you can also check your regex’s running time, and you can optimalize it