Hey everyone. I was hoping to get some advice regarding lack luster performance with my current setup. I have setup a Graylog cluster with 2 Graylog nodes running in docker (CE) swarm mode and two ES nodes running on RHEL 7 (a 3rd ES node and third Swarm node is on its way, provisioning just got delayed).
To make this post easier to read, I’ll put the details at the bottom. Basically, under light load (0-100 Message/sec) everything runs great. Queries are fast, input buffers are zero and the kafka journal stays below 500 messages. Under heavier load (200-500 messages /sec), things start to get awry. I have seen the output buffer go to 100% and restart the Graylog server, and the disk journal filled to 200,000+ before I shut off the input. I am really just trying to get a grip on how to properly diagnose what’s going on. Any help would be greatly appreciated.
From what I have observed, our system, on occasion, sends large log messages (around 10Mb max). I currently believe this is the culprit, and is putting large back pressure on the output buffer. I have code going out soon to limit these logs in size. To be 100% honest, I’m not really sure what the desired/nominal ranges are for many of these metrics. Possibly a great addition to the docs would be expected nominal metric statistics for common loads. Any input on this would also be very much appreciated.
Input Config
1 - Global GELF UDP Input on 12201 (1 on master, 1 on slave)
Messages are being sent as JSON over GELF UDP. Using JSON extractor to parse and create all fields.
Metrics of Concern
org.graylog2.outputs.BlockingBatchedESOutput.batchSize
Histogram
95th percentile:
19
98th percentile:
23
99th percentile:
26
Standard deviation:
5
Mean:
4
Minimum:
1
Maximum:
50
Count:
12,971
org.graylog2.outputs.BlockingBatchedESOutput.bufferFlushes
Meter
Total:
12,971 events
Mean:
0.31 events/second
1 minute avg:
0.41 events/second
5 minute avg:
0.4 events/second
15 minute avg:
0.41 events/second
org.graylog2.outputs.BlockingBatchedESOutput.bufferFlushesRequested
Meter
Total:
14,886 events
Mean:
0.36 events/second
1 minute avg:
0.7 events/second
5 minute avg:
0.7 events/second
15 minute avg:
0.69 events/second
org.graylog2.outputs.BlockingBatchedESOutput.bufferFlushFailures
Meter
Total:
0 events
Mean:
0 events/second
1 minute avg:
0 events/second
5 minute avg:
0 events/second
15 minute avg:
0 events/second
org.graylog2.outputs.BlockingBatchedESOutput.processTime
Timer
95th percentile:
264,434μs
98th percentile:
463,344μs
99th percentile:
463,344μs
Standard deviation:
97,894μs
Mean:
64,884μs
Minimum:
5,827μs
Maximum:
1,555,025μs
Custom Mapping - Prevent 32kb ES Limit Error
curl -X PUT http://localhost:9200/_template/graylog-custom-mapping?pretty -d '
{
"template": "graylog_*",
"mappings" : {
"message" : {
"properties" : {
"requestContent" : {
"type" : "string",
"index" : "no",
"doc_values": false
}
}
}
}
}'
Plugins
1 - Forked Slack Plugin by aenima4six2 (me).
2 - Aggregate Plugin
3 - HTTP Plugin
Graylog/Docker Setup (Note, 3rd ES Server is being provisioned soon to avoid Split Brain)
version: '3.1'
services:
mongodb-1:
image: **mongo:latest based image**
volumes:
- ./data/mongodb-1:/data/db
deploy:
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == worker
- node.labels.os_type == linux
- node.labels.db.mongo == mongodb-1
mongodb-2:
image: **mongo:latest based image**
volumes:
- ./data/mongodb-2:/data/db
deploy:
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == worker
- node.labels.os_type == linux
- node.labels.db.mongo == mongodb-2
mongodb-arbiter:
image: **mongo:latest based image**
deploy:
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == worker
- node.labels.os_type == linux
- node.labels.db.mongo == mongodb-arbiter
# MongoDB Replica Init Container
mongodb-init:
image: **mongo:latest based image**
depends_on:
- mongodb-1
- mongodb-2
- mongodb-arbiter
deploy:
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == worker
- node.labels.os_type == linux
graylog-master:
image: **graylog2/server:2.3.1-1 based image**
environment:
GRAYLOG_SERVER_JAVA_OPTS: '-Xms4g -Xmx4g -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
GRAYLOG_PASSWORD_SECRET: **redacted**
GRAYLOG_ROOT_PASSWORD_SHA2: **redacted**
GRAYLOG_WEB_LISTEN_URI: http://0.0.0.0:9000/
GRAYLOG_REST_LISTEN_URI: http://0.0.0.0:9000/api/
GRAYLOG_WEB_ENDPOINT_URI: http://graylog-master:9000/api/
GRAYLOG_REST_TRANSPORT_URI: http://graylog-master:9000/api/
GRAYLOG_REST_ENABLE_TLS: 'false'
GRAYLOG_WEB_ENABLE_TLS: 'false'
GRAYLOG_MONGODB_URI: mongodb://mongodb-1:27017,mongodb-2:27017/graylog?replicaSet=graylog
GRAYLOG_ELASTICSEARCH_SHARDS: 6
GRAYLOG_ELASTICSEARCH_REPLICAS: 1
GRAYLOG_ELASTICSEARCH_HOSTS: 'http://**REDACTED**-1:9200,http://**REDACTED**-2:9200'
GRAYLOG_IS_MASTER: 'true'
GRAYLOG_WEB_ENABLE: 'true'
volumes:
- ./data/master/journal:/usr/share/graylog/data/journal
ports:
- "12201:12201/udp"
deploy:
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == worker
- node.labels.os_type == linux
- node.labels.app.graylog == graylog-master
depends_on:
- mongodb-1
- mongodb-2
- mongodb-arbiter
graylog-slave:
image: **graylog2/server:2.3.1-1 based image**
entrypoint: /wait-for-it.sh graylog-master:9000 -t 60 -- /docker-entrypoint.sh graylog
environment:
GRAYLOG_SERVER_JAVA_OPTS: '-Xms4g -Xmx4g -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
GRAYLOG_PASSWORD_SECRET: **redacted**
GRAYLOG_ROOT_PASSWORD_SHA2: **redacted**
GRAYLOG_WEB_LISTEN_URI: http://0.0.0.0:9000/
GRAYLOG_REST_LISTEN_URI: http://0.0.0.0:9000/api/
GRAYLOG_WEB_ENDPOINT_URI: http://graylog-slave:9000/api/
GRAYLOG_REST_TRANSPORT_URI: http://graylog-slave:9000/api/
GRAYLOG_REST_ENABLE_TLS: 'false'
GRAYLOG_WEB_ENABLE_TLS: 'false'
GRAYLOG_MONGODB_URI: mongodb://mongodb-1:27017,mongodb-2:27017/graylog?replicaSet=graylog
GRAYLOG_ELASTICSEARCH_SHARDS: 6
GRAYLOG_ELASTICSEARCH_REPLICAS: 1
GRAYLOG_ELASTICSEARCH_HOSTS: 'http://**REDACTED**-1:9200,http://**REDACTED**-2:9200'
GRAYLOG_IS_MASTER: 'false'
GRAYLOG_WEB_ENABLE: 'true'
volumes:
- ./data/slave/journal:/usr/share/graylog/data/journal
ports:
- "12202:12201/udp"
deploy:
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == worker
- node.labels.os_type == linux
- node.labels.app.graylog == graylog-slave
depends_on:
- mongodb-1
- mongodb-2
- mongodb-arbiter
nginx:
image: **nginx:latest based image**
deploy:
mode: replicated
replicas: 2
restart_policy:
condition: on-failure
placement:
constraints:
- node.role == worker
- node.labels.os_type == linux
depends_on:
- graylog-slave
- graylog-master
ports:
- "12200:12200/tcp"
- "9000:9000"
Graylog Config - Comments Removed
node_id_file = /usr/share/graylog/data/config/node-id
plugin_dir = /usr/share/graylog/plugin
rest_listen_uri = http://0.0.0.0:9000/api/
rest_enable_cors = true
web_listen_uri = http://0.0.0.0:9000/
web_enable_cors = true
elasticsearch_hosts = http://elasticsearch:9200
elasticsearch_compression_enabled = false
allow_leading_wildcard_searches = true
allow_highlighting = false
output_batch_size = 5000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 10
outputbuffer_processors = 10
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /usr/share/graylog/data/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://mongo/graylog
mongodb_max_connections = 100
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_loader_enabled = true
content_packs_dir = /usr/share/graylog/data/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32
Graylog Server Specs - x2
Type: VM (VMWare)
Container Service: Docker CE - Swarm Mode
OS: RHEL 7
CPU: 8 Core
Memory: 8Gb
Disk: 350Mbps write / 1200 Mbs read SAN
Elasticsearch Server Specs - x2
Type: VM (VMWare)
Container Service: None
OS: RHEL 7
CPU: 4 Core
Memory: 15 Gb
Disk: 350Mbps write / 1200 Mbs read SAN