Hi there guys, lately our graylog server is processing about 6TB per day, and there are some problems with it… Lots of journaling being used graylog nodes very overloaded (load over 80) where i have an input of 60k-100k messages per second and output between 10k-40k
My graylog nodes are overloades with load of 80+ and when i look into the threads with top -H -d2 i see a bunch of outputprocessorbuffer and inputprocessorbuffer threads, disk io is low
when i look into my mongo load is 0.14 and very low io as well.
and in elastic same as mongo low load and very low io.
NEED HELP to improve message output… anyone with ideas?
My environment is all rhel9.6, with graylog 6.3.6 mongodb 7.0.25:
8 - Graylog Servers –> 16vpu and 16gb (Only graylog)
3 - Mongodb Servers –> 6vcpu and 8gb (Only graylog db)
10 - Elasticsearch Servers –> 20vcpu 30gb (only graylog data)
Here is my confs:
Graylog:
Server.conf:
is_leader = true
node_id_file = /etc/graylog/server/node-id
password_secret = #######
root_password_sha2 = ########
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
stream_aware_field_types=false
elasticsearch_hosts = https://graylog:P4$WD@elastic01example.com:9200,https://graylog:P4$WD@elastic02example.com:9200,https://graylog:P4$WD@elastic03example.com:9200,https://graylog:P4$WD@elastic04example.com:9200,https://graylog:P4$WD@elastic05example.com:9200,https://graylog:P4$WD@elastic06example.com:9200,https://graylog:P4$WD@elastic07example.com:9200,https://graylog:P4$WD@elastic08example.com:9200,https://graylog:P4$WD@elastic09example.com:9200,https://graylog:P4$WD@elastic10example.com:9200
disabled_retention_strategies = none,close
allow_leading_wildcard_searches = false
allow_highlighting = false
field_value_suggestion_mode = on
lb_recognition_period_seconds = 3
integrations_scripts_dir = /usr/share/graylog-server/scripts
mongodb_uri = mongodb://graylog:P4$WD@mongo01:27017,mongo02:27017,mongo03:27017/graylog?replicaSet=graylogReplicaSetProducao
##################
Graylog Tuning
##################
Buffers
processbuffer_processors = 6
#outputbuffer_processor_threads_max_pool_size = 5
outputbuffer_processors = 8
inputbuffer_processors = 2
Larger internal queues
ring_size = 262144
inputbuffer_ring_size = 262144
Lower latency handoff between buffers
processor_wait_strategy = yielding
inputbuffer_wait_strategy = blocking
Output tuning
output_batch_size = 10mb
output_flush_interval = 1
output_fault_count_threshold = 50
output_fault_penalty_seconds = 1
Message Journal (acts as safety buffer)
message_journal_enabled = true
message_journal_dir = /opt/graylog/journal
message_journal_max_size = 20gb
message_journal_segment_size = 500mb
message_journal_flush_age = 30s
message_journal_flush_interval = 1000000
Elasticsearch connections
elasticsearch_max_total_connections = 500
elasticsearch_max_total_connections_per_route = 50
MongoDB
mongodb_max_connections = 200
mongodb_threads_allowed_to_block_multiplier = 10
MONGODB:
=============================
/etc/mongod.conf optimized
=============================
storage:
dbPath: /opt/mongodb
wiredTiger:
engineConfig:
cacheSizeGB: 50
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true
systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true
net:
bindIp: 0.0.0.0
port: 27017
maxIncomingConnections: 65535
processManagement:
fork: false
security:
keyFile: “/opt/mongo/mongo.key”
authorization: enabled
replication:
replSetName: “rs0”
ELASTIC:
======================== Elasticsearch Configuration =========================
---------------------------------- Cluster -----------------------------------
cluster.name: elastic-prod-00
------------------------------------ Node ------------------------------------
node.name: elastic01.example.com
Node roles (adjust per node type)
For data nodes: node.roles: [ data, ingest ]
For master-eligible nodes: node.roles: [ master ]
For coordinating nodes: node.roles:
----------------------------------- Paths ------------------------------------
path.data: /data
path.logs: /var/log/elasticsearch
----------------------------------- Memory -----------------------------------
CRITICAL: Enable memory lock for production (requires system configuration)
bootstrap.memory_lock: true
---------------------------------- Network -----------------------------------
network.host: 0.0.0.0
http.port: 9200
--------------------------------- Discovery ----------------------------------
discovery.seed_hosts: [“elastic01example.com”,“elastic02example.com”,“elastic03example.com”,“elastic04example.com”,“elastic05example.com”,“elastic06example.com”,“elastic07example.com”,“elastic08example.com”,“elastic09example.com”,“elastic10example.com”]
cluster.initial_master_nodes: [“elastic01example.com”,“elastic10example.com”]
---------------------------------- Various -----------------------------------
action.destructive_requires_name: true
Circuit Breaker Settings - Optimized for high throughput
indices.breaker.total.use_real_memory: false
indices.breaker.total.limit: 85%
indices.breaker.fielddata.limit: 40%
indices.breaker.request.limit: 60%
-------------------------------- Thread Pools --------------------------------
Optimized for 18 vCPU and high write throughput
thread_pool:
write:
size: 18
queue_size: 50000
search:
size: 28 # (18 * 3 / 2) + 1 = 28
queue_size: 5000
get:
size: 28
queue_size: 5000
analyze:
size: 1
queue_size: 16
-------------------------------- Performance Settings --------------------------------
Indexing Performance
indices.memory.index_buffer_size: 20%
indices.memory.min_index_buffer_size: 96mb
Query Performance
indices.queries.cache.size: 15%
indices.requests.cache.size: 5%
Fielddata Circuit Breaker
indices.fielddata.cache.size: 30%
-------------------------------- Cluster Settings --------------------------------
Shard allocation and recovery settings for high throughput
cluster.routing.allocation.node_concurrent_recoveries: 4
cluster.routing.allocation.node_initial_primaries_recoveries: 6
cluster.routing.allocation.same_shard.host: false
Shard rebalancing for optimal distribution
cluster.routing.rebalance.enable: all
cluster.routing.allocation.allow_rebalance: indices_all_active
cluster.routing.allocation.cluster_concurrent_rebalance: 4
Watermark settings for disk usage (adjust based on disk size)
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%
---------------------------------- Security ----------------------------------
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.keystore.path: certs/elastic-nodes-prod.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-nodes-prod.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/certElasticSICOOB.p12
xpack.security.authc.realms.file.file1.order: 0
-------------------------------- Additional Optimizations --------------------------------
HTTP settings for better client connections
http.max_content_length: 200mb
http.compression: true
http.cors.enabled: false
Transport settings
transport.tcp.compress: true
Node attribute for rack awareness (if using)
node.attr.rack: rack1
Prevent split brain in smaller clusters
discovery.zen.minimum_master_nodes: 2
Action timeout settings
action.auto_create_index: true
