Hi community
Please shed some light wherever you can, happy to provide more information.
We have recently upgraded our on-premises Graylog cluster to V4 and it is now very slow.
The cluster is set up with 2x Graylog nodes, 3x Elasticsearch and MongoDB replica-set running on the same hosts as Elasticsearch. We load-balance the Graylog nodes using NGINX. All of these nodes are in the same network zone, no firewalls.
We have tried removing NGINX to access the nodes directly, performance still the same. Elasticsearch HTTP response time is all good. The load/performance on all the nodes looks good. Nothing out of the ordinary on the log files for mongodb, elasticsearch and graylog-server.
We have a similar setup in production which is running on Graylog 3.0.0 and is performing as expected.
Hosts + Setup Info
All linux Ubuntu 16.04
2x Graylog nodes:
- 16GB Memory
- 16Cores
- Each node running Graylog v4.0.0 with heap config: -Xms3g -Xmx10g
3x Elasticsearch + MongoDB nodes:
- 16GB Memory
- 8Cores
- Each running MongoDB v4.0.21 + Elasticsearch v6.8.13 with heap: -Xms8g -Xmx8g
Configurations in Ansible template form - security purposes:
server.conf
############################
# GRAYLOG CONFIGURATION FILE
###########################
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = {{ graylog_ui_password }}
root_username = {{ graylog_ui_username }}
root_password_sha2 = {{ graylog_ui_password }}
root_timezone = Africa/Johannesburg
plugin_dir = {{ plugin_dir }}
###############
# HTTP settings
###############
http_bind_address = 0.0.0.0:{{ graylog_listen_port }}
http_publish_uri = https://{{ inventory_hostname }}:{{ graylog_listen_port }}/
http_external_uri = https://{{ loadbalancer_url }}/
http_enable_tls = true
http_tls_cert_file = {{ cert_file }}
http_tls_key_file = {{ key_file }}
elasticsearch_hosts = http://{{ elasticsearch_mongo_hosts[0] }}:{{ elasticsearch_listen_port }},\
http://{{ elasticsearch_mongo_hosts[1] }}:{{ elasticsearch_listen_port }},\
http://{{ elasticsearch_mongo_hosts[2] }}:{{ elasticsearch_listen_port }}
elasticsearch_connect_timeout = 20s
elasticsearch_max_total_connections = 40
elasticsearch_max_total_connections_per_route = 4
rotation_strategy = time
elasticsearch_max_time_per_index = {{ es_index_rotation }}
elasticsearch_max_number_of_indices = 180
retention_strategy = delete
elasticsearch_shards = 3
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
elasticsearch_analyzer = standard
elasticsearch_request_timeout = 2m
elasticsearch_index_optimization_jobs = 50
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_age = 12h
message_journal_max_size = 5gb
lb_recognition_period_seconds = 3
stream_processing_timeout = 5000
stream_processing_max_faults = 5
mongodb_uri = mongodb://{{ mongodb_user }}:{{ mongodb_pass }}@{{ elasticsearch_mongo_hosts[0] }}:{{ mongodb_listen_port }},\
{{ elasticsearch_mongo_hosts[1] }}:{{ mongodb_listen_port }},\
{{ elasticsearch_mongo_hosts[2] }}:{{ mongodb_listen_port }}/{{ mongodb }}
mongodb_max_connections = 100
mongodb_threads_allowed_to_block_multiplier = 5
proxied_requests_thread_pool_size = 32
elasticsearch.yml
# ======================== Elasticsearch Configuration =========================
cluster.name: {{ cluster }}
node.name: ${HOSTNAME}
path.data: {{ es_data_dir }}
path.logs: {{ es_log_dir }}
bootstrap.memory_lock: {{ memory_lock }}
network.host: ${HOSTNAME}
http.port: {{ elasticsearch_listen_port }}
discovery.zen.ping.unicast.hosts: ["{{ elasticsearch_mongo_hosts[0] }}", "{{ elasticsearch_mongo_hosts[1] }}", "{{ elasticsearch_mongo_hosts[2] }}"]
discovery.zen.minimum_master_nodes: 2
gateway.recover_after_nodes: 2
mongo.conf
storage:
dbPath: {{ mongo_data_dir }}
journal:
commitIntervalMs: 120
directoryPerDB: true
syncPeriodSecs: 80
systemLog:
destination: file
logAppend: true
verbosity: 1
traceAllExceptions: true
logRotate: rename
timeStampFormat: ctime
path: /var/log/mongodb/mongod.log
net:
port: {{ mongodb_listen_port }}
bindIp: {{ inventory_hostname }}, 127.0.0.1
bindIpAll: false
maxIncomingConnections: 51200
wireObjectCheck: false
ipv6: false
unixDomainSocket:
enabled: true
pathPrefix: /tmp
filePermissions: 0700
ssl:
mode: allowSSL
PEMKeyFile: {{ PEMKeyFile }}
clusterFile: {{ PEMKeyFile }}
CAFile: {{ CAFile }}
allowConnectionsWithoutCertificates: false
allowInvalidCertificates: false
allowInvalidHostnames: false
compression:
compressors: zlib,snappy
transportLayer: asio
serviceExecutor: synchronous
processManagement:
timeZoneInfo: /usr/share/zoneinfo
pidFilePath: /var/log/mongodb/mongo.pid
fork: false
security:
keyFile: {{ keyFile }}
clusterAuthMode: keyFile
authorization: enabled
transitionToAuth: false
javascriptEnabled: true
operationProfiling:
mode: slowOp
slowOpThresholdMs: 10000
slowOpSampleRate: 1.0
replication:
replSetName: graylog-dev
secondaryIndexPrefetch: all
Nodes overview:
Sample profiling:
Thread dump on the nodes shows a couple of locks and threads in WAITING state. I hit the Body limit so I cannot post that here.