Graylog not outputing messages, input 60k output 30k

Hi there guys, lately our graylog server is processing about 6TB per day, and there are some problems with it… Lots of journaling being used graylog nodes very overloaded (load over 80) where i have an input of 60k-100k messages per second and output between 10k-40k

My graylog nodes are overloades with load of 80+ and when i look into the threads with top -H -d2 i see a bunch of outputprocessorbuffer and inputprocessorbuffer threads, disk io is low

when i look into my mongo load is 0.14 and very low io as well.

and in elastic same as mongo low load and very low io.

NEED HELP to improve message output… anyone with ideas?

My environment is all rhel9.6, with graylog 6.3.6 mongodb 7.0.25:

8 - Graylog Servers –> 16vpu and 16gb (Only graylog)
3 - Mongodb Servers –> 6vcpu and 8gb (Only graylog db)

10 - Elasticsearch Servers –> 20vcpu 30gb (only graylog data)

Here is my confs:

Graylog:

Server.conf:
is_leader = true
node_id_file = /etc/graylog/server/node-id
password_secret = #######
root_password_sha2 = ########
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
stream_aware_field_types=false
elasticsearch_hosts = https://graylog:P4$WD@elastic01example.com:9200,https://graylog:P4$WD@elastic02example.com:9200,https://graylog:P4$WD@elastic03example.com:9200,https://graylog:P4$WD@elastic04example.com:9200,https://graylog:P4$WD@elastic05example.com:9200,https://graylog:P4$WD@elastic06example.com:9200,https://graylog:P4$WD@elastic07example.com:9200,https://graylog:P4$WD@elastic08example.com:9200,https://graylog:P4$WD@elastic09example.com:9200,https://graylog:P4$WD@elastic10example.com:9200
disabled_retention_strategies = none,close
allow_leading_wildcard_searches = false
allow_highlighting = false
field_value_suggestion_mode = on
lb_recognition_period_seconds = 3
integrations_scripts_dir = /usr/share/graylog-server/scripts
mongodb_uri = mongodb://graylog:P4$WD@mongo01:27017,mongo02:27017,mongo03:27017/graylog?replicaSet=graylogReplicaSetProducao

##################

Graylog Tuning

##################

Buffers

processbuffer_processors = 6
#outputbuffer_processor_threads_max_pool_size = 5
outputbuffer_processors = 8
inputbuffer_processors = 2

Larger internal queues

ring_size = 262144
inputbuffer_ring_size = 262144

Lower latency handoff between buffers

processor_wait_strategy = yielding
inputbuffer_wait_strategy = blocking

Output tuning

output_batch_size = 10mb
output_flush_interval = 1
output_fault_count_threshold = 50
output_fault_penalty_seconds = 1

Message Journal (acts as safety buffer)

message_journal_enabled = true
message_journal_dir = /opt/graylog/journal
message_journal_max_size = 20gb
message_journal_segment_size = 500mb
message_journal_flush_age = 30s
message_journal_flush_interval = 1000000

Elasticsearch connections

elasticsearch_max_total_connections = 500
elasticsearch_max_total_connections_per_route = 50

MongoDB

mongodb_max_connections = 200
mongodb_threads_allowed_to_block_multiplier = 10

MONGODB:

=============================

/etc/mongod.conf optimized

=============================

storage:
dbPath: /opt/mongodb
wiredTiger:
engineConfig:
cacheSizeGB: 50
collectionConfig:
blockCompressor: snappy
indexConfig:
prefixCompression: true

systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true

net:
bindIp: 0.0.0.0
port: 27017
maxIncomingConnections: 65535

processManagement:
fork: false

security:
keyFile: “/opt/mongo/mongo.key”
authorization: enabled

replication:
replSetName: “rs0”

ELASTIC:

======================== Elasticsearch Configuration =========================

---------------------------------- Cluster -----------------------------------

cluster.name: elastic-prod-00

------------------------------------ Node ------------------------------------

node.name: elastic01.example.com

Node roles (adjust per node type)

For data nodes: node.roles: [ data, ingest ]

For master-eligible nodes: node.roles: [ master ]

For coordinating nodes: node.roles:

----------------------------------- Paths ------------------------------------

path.data: /data
path.logs: /var/log/elasticsearch

----------------------------------- Memory -----------------------------------

CRITICAL: Enable memory lock for production (requires system configuration)

bootstrap.memory_lock: true

---------------------------------- Network -----------------------------------

network.host: 0.0.0.0
http.port: 9200

--------------------------------- Discovery ----------------------------------

discovery.seed_hosts: [“elastic01example.com”,“elastic02example.com”,“elastic03example.com”,“elastic04example.com”,“elastic05example.com”,“elastic06example.com”,“elastic07example.com”,“elastic08example.com”,“elastic09example.com”,“elastic10example.com”]
cluster.initial_master_nodes: [“elastic01example.com”,“elastic10example.com”]

---------------------------------- Various -----------------------------------

action.destructive_requires_name: true

Circuit Breaker Settings - Optimized for high throughput

indices.breaker.total.use_real_memory: false
indices.breaker.total.limit: 85%
indices.breaker.fielddata.limit: 40%
indices.breaker.request.limit: 60%

-------------------------------- Thread Pools --------------------------------

Optimized for 18 vCPU and high write throughput

thread_pool:
write:
size: 18
queue_size: 50000
search:
size: 28 # (18 * 3 / 2) + 1 = 28
queue_size: 5000
get:
size: 28
queue_size: 5000
analyze:
size: 1
queue_size: 16

-------------------------------- Performance Settings --------------------------------

Indexing Performance

indices.memory.index_buffer_size: 20%
indices.memory.min_index_buffer_size: 96mb

Query Performance

indices.queries.cache.size: 15%
indices.requests.cache.size: 5%

Fielddata Circuit Breaker

indices.fielddata.cache.size: 30%

-------------------------------- Cluster Settings --------------------------------

Shard allocation and recovery settings for high throughput

cluster.routing.allocation.node_concurrent_recoveries: 4
cluster.routing.allocation.node_initial_primaries_recoveries: 6
cluster.routing.allocation.same_shard.host: false

Shard rebalancing for optimal distribution

cluster.routing.rebalance.enable: all
cluster.routing.allocation.allow_rebalance: indices_all_active
cluster.routing.allocation.cluster_concurrent_rebalance: 4

Watermark settings for disk usage (adjust based on disk size)

cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%

---------------------------------- Security ----------------------------------

xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: certificate
xpack.security.transport.ssl.client_authentication: required
xpack.security.transport.ssl.keystore.path: certs/elastic-nodes-prod.p12
xpack.security.transport.ssl.truststore.path: certs/elastic-nodes-prod.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: certs/certElasticSICOOB.p12
xpack.security.authc.realms.file.file1.order: 0

-------------------------------- Additional Optimizations --------------------------------

HTTP settings for better client connections

http.max_content_length: 200mb
http.compression: true
http.cors.enabled: false

Transport settings

transport.tcp.compress: true

Node attribute for rack awareness (if using)

node.attr.rack: rack1

Prevent split brain in smaller clusters

discovery.zen.minimum_master_nodes: 2

Action timeout settings

action.auto_create_index: true

Hello @tadeu.alves,

What do your index set configs looks like, shards per index and avg index size?

You could try increasing output batch size and upping the output buffer processes while lowering the process buffer or if there is a large amount of processing reversing that.

What is avg % utilised in both process and output buffer on the Graylog nodes?

If not already doing so, having a system like Grafana to map metrics could help in tuning.

What do your index set configs looks like, shards per index and avg index size?
R – I have 6 indexes that are about 1.5tb each, each index have 60 shards that have about 30GB shard size.

You could try increasing output batch size and upping the output buffer processes while lowering the process buffer or if there is a large amount of processing reversing that.
R – We tried that did not change much, if nothing at all. (but im willing to test any configurations that you sugest at this point).

What is avg % utilised in both process and output buffer on the Graylog nodes?
R – All nodes keep input and output butffer at 100%, output is low when isn’t 0.

If not already doing so, having a system like Grafana to map metrics could help in tuning.
R – Did not implement it yet have any good tutorials?

Hey @tadeu.alves,

This guide is a very useful start for gathering metrics from OS and Graylog.

Output buffer running at 100% generally indicates the issue is with Opensearch as messages can’t be ingested quick enough and thus the buffer fills. The core issue can be hard to identify without metrics as changes made would be closer to a guess.

Do you see performance across all OS nodes stay the same or are certain OS nodes running “hot”?

Are three of the OS nodes dedicated leaders?

Can the OS nodes write quick enough to keep up with this kind of load, is your storage performant enough. The below API calls to OS might give some insight.

GET _cat/thread_pool/write?v&h=node_name,active,queue,rejected
GET _cat/thread_pool/management?v&h=node_name,active,queue,rejected
GET _nodes/stats/thread_pool?human
GET _nodes/stats/http?human     # http.current_open, http.total_opened
GET _nodes/hot_threads          # see if anything is blocking the write path

I will add all the output from the curl files

But in summary no rejections and the queues were all less than 10 in write

and management

Here are all the files of the commands that you requested @Wine_Merchant

curl_hot_2.pdf (6.0 MB)
curl_hot.pdf (5.8 MB)
curl_write.pdf (44.0 KB)
curl_stats.pdf (593.3 KB)
curl_human.pdf (5.2 MB)

And thanks for the help, i really appreciate because i have no idea what to do anymore.

@tadeu.alves Do you any output from Graylog to anywhere other than the OS cluster, for example forwarding via syslog output from Graylog to a SIEM?

I’m guessing the Graylog/Opensearch logs don’t give an indication of an issue?

Nop, all logs stay in Graylog i don’t output them anywhere else. For SIEM logs I’ve created another appender to send it in parallel to another place, Graylog and it’s ELASTIC doesn’t even know what is going on.

The threads appear to be mostly occupied by merges but at this scale that is expected. What is the return of the below?

GET _cat/segments
GET _cat/indices?h=index,pri,rep,docs.count,store.size,seg.size
GET _nodes/stats/indices/indexing,merge,refresh,search

Perhaps increasing total GB per shard and reducing overall shards per index could help, for a start try 50GB pershard and 40 shards to an index (better math could be used here). Also try disabling the index optimisation options within the index config and increasing Field Type Refresh Interval

Screenshot 2025-11-14 at 12.14.56

Are there three dedicated leader nodes?

No, only 1 leader.

I have 7 indexes each with:

APPS - 60 shards with 180 segments and a field refresh of 120s (the most queried app logs for devs to debug)
APPA - 10 shards with 30 segments and field refresh of 120s
APPB - 10 shards with 30 segments and field refresh of 120s
APPC - 10 shards with 30 segments and field refresh of 120s
APPD - 20 shards with 60 segments and field refresh of 120s
APPE - 20 shards with 60 segments and field refresh of 120s
APPF - 10 shards with 30 segments and field refresh of 120s

Each shard is about 22gb to 46gb i was looking at the sweet spot between 25GB and 50GB with a retention of 5 days (this bring’s me to a total of 12TB, every day is about 2TB of logs processed)

My Outgoing traffic is about 6TB daily.

curl_indices.pdf (20.8 KB)
curl_segments.pdf (929.9 KB)
curl_nodes.pdf (26.3 KB)

Cherry picking some examples.

Am I right in thinking that the below index is 527GB and consists of 60 shards?

APPS_275 60 0 533484198 527.2gb

Or the below is 103GB and 10 shards?

APPB_270 20 0 161360348 103.8gb

What is the current rotation strategy?

No problem, they are the logs of today that are still runnning.

I rotate them everyday.

APPS_275 is 872GB right now

and

APPB_270 is 110GB

Just an update:

APPS_275 - 1.27TB

and

APPB_270 - 140GB

@tadeu.alves

I can’t say this is core of the issue but it does seem like the amount of shards per index can be lowered. 60 shards for 1.2TB is overkill. Take the APPE index set, it seems to frequently have under 1GB of data yet there at 10 shards.

This over-provisioning creates a large amount of management overhead for Opensearch an leads to lot’s of unnecessary work.

I would look to consolidate the index sets, perhaps APPE and APPD could be merged and review the amount of shards per index. If not already in use, take advantage of the time size optimised rotation strategy.

But the idea isn’t a shard with a size between 25 and 50 gb?

that’s what is my aim with this ammount of shards…

I tried lot’s of different configurations and all stay the same: high input with low output with input and process buffers at 100% and output wih 0% and when i look my servers:

top - 17:46:46 up 9 days, 19:31, 1 user, load average: 32.77, 27.44, 24.89
Threads: 829 total, 42 running, 787 sleeping, 0 stopped, 0 zombie
%Cpu0 : 58.2 us, 18.8 sy, 0.0 ni, 15.4 id, 0.0 wa, 7.2 hi, 0.5 si, 0.0 st
%Cpu1 : 51.2 us, 17.2 sy, 0.0 ni, 25.1 id, 0.0 wa, 6.4 hi, 0.0 si, 0.0 st
%Cpu2 : 29.0 us, 22.7 sy, 0.0 ni, 40.1 id, 0.0 wa, 8.2 hi, 0.0 si, 0.0 st
%Cpu3 : 38.1 us, 25.9 sy, 0.0 ni, 28.4 id, 0.0 wa, 7.6 hi, 0.0 si, 0.0 st
%Cpu4 : 84.3 us, 9.3 sy, 0.0 ni, 3.9 id, 0.0 wa, 2.5 hi, 0.0 si, 0.0 st
%Cpu5 : 61.1 us, 16.3 sy, 0.0 ni, 13.8 id, 0.0 wa, 6.4 hi, 2.5 si, 0.0 st
%Cpu6 : 86.1 us, 7.2 sy, 0.0 ni, 2.9 id, 0.0 wa, 3.8 hi, 0.0 si, 0.0 st
%Cpu7 : 63.2 us, 21.1 sy, 0.0 ni, 8.8 id, 0.0 wa, 5.4 hi, 1.5 si, 0.0 st
%Cpu8 : 31.0 us, 24.5 sy, 0.0 ni, 35.5 id, 0.0 wa, 9.0 hi, 0.0 si, 0.0 st
%Cpu9 : 89.3 us, 5.9 sy, 0.0 ni, 0.5 id, 0.0 wa, 2.0 hi, 2.4 si, 0.0 st
%Cpu10 : 60.6 us, 15.9 sy, 0.0 ni, 16.8 id, 0.0 wa, 6.7 hi, 0.0 si, 0.0 st
%Cpu11 : 69.8 us, 13.2 sy, 0.0 ni, 9.3 id, 0.0 wa, 5.9 hi, 2.0 si, 0.0 st
%Cpu12 : 83.5 us, 11.2 sy, 0.0 ni, 1.5 id, 0.0 wa, 3.9 hi, 0.0 si, 0.0 st
%Cpu13 : 89.8 us, 7.8 sy, 0.0 ni, 0.5 id, 0.0 wa, 1.9 hi, 0.0 si, 0.0 st
%Cpu14 : 57.1 us, 18.5 sy, 0.0 ni, 17.6 id, 0.0 wa, 6.3 hi, 0.5 si, 0.0 st
%Cpu15 : 83.5 us, 10.7 sy, 0.0 ni, 1.9 id, 0.0 wa, 3.9 hi, 0.0 si, 0.0 st
MiB Mem : 9696.5 total, 213.8 free, 9056.9 used, 831.0 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 639.6 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                 

591386 graylog 20 0 16.2g 7.2g 23064 R 53.9 76.2 2:49.28 processbufferpr
591383 graylog 20 0 16.2g 7.2g 23064 S 53.4 76.2 2:50.99 processbufferpr
591381 graylog 20 0 16.2g 7.2g 23064 R 51.5 76.2 2:54.17 processbufferpr
591394 graylog 20 0 16.2g 7.2g 23064 R 51.0 76.2 2:52.60 processbufferpr
591390 graylog 20 0 16.2g 7.2g 23064 R 49.5 76.2 2:48.90 processbufferpr
591384 graylog 20 0 16.2g 7.2g 23064 R 48.0 76.2 2:51.05 processbufferpr
591388 graylog 20 0 16.2g 7.2g 23064 S 46.1 76.2 2:46.09 processbufferpr
591396 graylog 20 0 16.2g 7.2g 23064 R 45.6 76.2 2:46.02 processbufferpr
591380 graylog 20 0 16.2g 7.2g 23064 R 45.1 76.2 2:52.09 processbufferpr
591382 graylog 20 0 16.2g 7.2g 23064 S 39.7 76.2 2:47.55 processbufferpr
591393 graylog 20 0 16.2g 7.2g 23064 R 36.8 76.2 2:48.71 processbufferpr
591379 graylog 20 0 16.2g 7.2g 23064 R 35.3 76.2 2:49.89 processbufferpr
591398 graylog 20 0 16.2g 7.2g 23064 R 34.8 76.2 2:43.20 processbufferpr
591391 graylog 20 0 16.2g 7.2g 23064 S 34.3 76.2 2:48.81 processbufferpr
591397 graylog 20 0 16.2g 7.2g 23064 R 33.8 76.2 2:43.67 processbufferpr
591385 graylog 20 0 16.2g 7.2g 23064 R 31.9 76.2 2:46.14 processbufferpr
591389 graylog 20 0 16.2g 7.2g 23064 R 31.9 76.2 2:46.76 processbufferpr
591395 graylog 20 0 16.2g 7.2g 23064 R 29.9 76.2 2:50.30 processbufferpr
591387 graylog 20 0 16.2g 7.2g 23064 S 29.4 76.2 2:39.49 processbufferpr
591392 graylog 20 0 16.2g 7.2g 23064 R 29.4 76.2 2:49.85 processbufferpr
591347 graylog 20 0 16.2g 7.2g 23064 R 24.5 76.2 1:32.70 outputbufferpro
591346 graylog 20 0 16.2g 7.2g 23064 R 24.0 76.2 1:30.78 outputbufferpro

But if i look my elastic servers:

top - 17:48:35 up 145 days, 23:07, 1 user, load average: 3.29, 4.11, 3.89
Threads: 692 total, 3 running, 689 sleeping, 0 stopped, 0 zombie
%Cpu0 : 6.9 us, 1.0 sy, 0.0 ni, 91.1 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu1 : 13.8 us, 1.0 sy, 0.0 ni, 84.2 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 28.6 us, 0.5 sy, 0.0 ni, 70.4 id, 0.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 7.5 us, 0.0 sy, 0.0 ni, 92.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 7.0 us, 0.5 sy, 0.0 ni, 91.5 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.5 us, 0.0 sy, 0.0 ni, 99.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 11.4 us, 0.0 sy, 0.0 ni, 87.1 id, 1.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 1.5 us, 0.0 sy, 0.0 ni, 98.5 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 7.0 us, 1.5 sy, 0.0 ni, 90.0 id, 1.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 9.1 us, 0.0 sy, 0.0 ni, 88.4 id, 2.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 8.6 us, 1.0 sy, 0.0 ni, 88.4 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 21.7 us, 3.0 sy, 0.0 ni, 67.7 id, 6.6 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu12 : 15.0 us, 0.5 sy, 0.0 ni, 82.5 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 18.5 us, 1.0 sy, 0.0 ni, 80.0 id, 0.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 16.9 us, 1.0 sy, 0.0 ni, 72.6 id, 1.0 wa, 0.0 hi, 8.5 si, 0.0 st
%Cpu15 : 1.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu16 : 37.8 us, 1.5 sy, 0.0 ni, 60.2 id, 0.5 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu17 : 19.3 us, 1.0 sy, 0.0 ni, 77.7 id, 2.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem: 36859636 total, 36589292 used, 270344 free, 54280 buffers
KiB Swap: 6815740 total, 0 used, 6815740 free. 7231724 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
94137 elastic+ 20 0 0.419t 0.028t 2.274g D 37.62 80.41 5:20.02 elasticsearch[e
94132 elastic+ 20 0 0.419t 0.028t 2.274g S 30.20 80.41 5:31.22 elasticsearch[e
108708 elastic+ 20 0 0.419t 0.028t 2.274g S 28.71 80.41 79:36.39 elasticsearch[e
108719 elastic+ 20 0 0.419t 0.028t 2.274g S 21.78 80.41 71:54.82 elasticsearch[e
108709 elastic+ 20 0 0.419t 0.028t 2.274g S 18.32 80.41 12:59.41 elasticsearch[e
108777 elastic+ 20 0 0.419t 0.028t 2.274g S 6.931 80.41 307:52.86 elasticsearch[e
108721 elastic+ 20 0 0.419t 0.028t 2.274g S 6.436 80.41 89:14.54 elasticsearch[e
108775 elastic+ 20 0 0.419t 0.028t 2.274g R 6.436 80.41 308:00.53 elasticsearch[e
108779 elastic+ 20 0 0.419t 0.028t 2.274g S 6.436 80.41 308:00.33 elasticsearch[e
108782 elastic+ 20 0 0.419t 0.028t 2.274g R 6.436 80.41 307:32.55 elasticsearch[e
108773 elastic+ 20 0 0.419t 0.028t 2.274g S 5.941 80.41 308:07.92 elasticsearch[e
108785 elastic+ 20 0 0.419t 0.028t 2.274g S 5.941 80.41 307:59.32 elasticsearch[e
108774 elastic+ 20 0 0.419t 0.028t 2.274g S 5.446 80.41 307:56.68 elasticsearch[e
108776 elastic+ 20 0 0.419t 0.028t 2.274g D 5.446 80.41 307:57.20 elasticsearch[e
108778 elastic+ 20 0 0.419t 0.028t 2.274g S 5.446 80.41 307:56.91 elasticsearch[e
108783 elastic+ 20 0 0.419t 0.028t 2.274g D 5.446 80.41 307:48.03 elasticsearch[e
108786 elastic+ 20 0 0.419t 0.028t 2.274g S 5.446 80.41 307:46.03 elasticsearch[e
108781 elastic+ 20 0 0.419t 0.028t 2.274g D 4.950 80.41 307:43.19 elasticsearch[e
108784 elastic+ 20 0 0.419t 0.028t 2.274g D 4.950 80.41 308:01.30 elasticsearch[e
108780 elastic+ 20 0 0.419t 0.028t 2.274g S 4.455 80.41 308:17.76 elasticsearch[e
108713 elastic+ 20 0 0.419t 0.028t 2.274g S 1.980 80.41 73:42.09 elasticsearch[e
1496 root 20 0 0 0 0 D 1.485 0.000 2243:56 jbd2/dm-3-8
108682 elastic+ 20 0 0.419t 0.028t 2.274g S 1.485 80.41 115:42.21 elasticsearch[e

Not to sound like a broken again but diagnosing bottlenecks will require a more holistic view of the cluster an can be tricky, if you really need to the problem solved I would dedicate some time to setting up Grafana.

As for the shard sizing, I’m not sure what it is that is being achieved by having 10 shards for a single 1GB index. It would be beneficial to review shards per index so they are better optimised and there is less merge activity within Opensearch/Elasticsearch.

If you’ve achieved throughput where output and processing buffers are 100% then I would imagine it’s still Opensearch/Elasticsearch that is the issue but we are missing just what that is.

Something else to consider would be the underlying hardware, are the hard drives being used for the the journal and elastic servers able to handle the throughput. Does iostat -xm 1 on the Graylog and ES servers show anything out of the ordinary?

Sorry for the long time to answer,

Do you have any tutorials into how to create this Grafana monitoring?
Because of the end of the year rush, we’re short-staffed at my workplace at the moment.

And the: iostat -xm 1 did not say much, only that my graylog have idle disks and high cpu.

And my elastic is even “worse” idle disk and idle cpu.

Hey @tadeu.alves,

No worries, I very much understand.

You can use this guide, which is a great starting point.

I hope you’ve taken into consideration my earlier points on shard/index sizing.

Hi there,

looking around further i found that the problem is my mongodb (who would know)

Server is overloaded with lot’s of messages with the text:

msg":“Client’s executor exceeded time limit”,“attr”:{“elapsedMicros”:XX,“limitMicros”:30}}

I have a cluster of 3 mongos with 16vcpu and 16gb

cat /etc/mongod.conf

=============================

/etc/mongod.conf optimized

=============================

storage:
dbPath: /opt/mongodb
wiredTiger:
engineConfig:
cacheSizeGB: 50 # ~50% of total RAM if dedicated host
collectionConfig:
blockCompressor: snappy # Good tradeoff between CPU and speed
indexConfig:
prefixCompression: true

systemLog:
destination: file
path: /var/log/mongodb/mongod.log
logAppend: true

net:
bindIp: 0.0.0.0
port: 27017
maxIncomingConnections: 65535

processManagement:
fork: false

security:
keyFile: “/etc/ssl/mongo_key/chaveReplicaSet.key”
authorization: enabled

replication:
replSetName: “rs0”

My storage is a xfs 50GB lun with low IO but medium network throughput (about 350mb).

Any ideas what can i improve in my mongo?

my graylog servers are 8 nodes each with:

mongodb_max_connections = 1000

no other config only this..

If there’s any idea in what to improve @Wine_Merchant i’m all ears (or in this case eyes :D)

I’m not familiar with this mongo error, are there logs within the Graylog server logs indicating that Mongo is at times unavailable? Are all mongo nodes within the replica set in good health, an rs.status() on the primary would give an indication.

What makes you think Mongo is the culprit here?