Graylog throws bulk errors from several indexes


We have 2 graylog nodes with these settings and specs;
Our outgoing traffic at yesterday was around 1TB

One node has: 2x14 Core
One node has: 2x12 Core
Both of them have: 64 GB Memory
And some of settings are;

output_batch_size = 200
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 10
outputbuffer_processors = 10

And our mongodb cluster located on 3 elasticsearch nodes.

And we have 8 elasticsearch nodes with these general and bulk settings and all of them have local SSD Disks for data operations.

Nodes have 31.3 GB heap size and average 60% of heap size is using by elasticsearch.
Nodes have 2x8 Core and 192 GB RAM

Cluster health;

“epoch”: “1552458774”,
“timestamp”: “09:32:54”,
“cluster”: “graylog”,
“status”: “green”,
“”: “8”,
“”: “8”,
“shards”: “50736”,
“pri”: “25368”,
“relo”: “0”,
“init”: “0”,
“unassign”: “0”,
“pending_tasks”: “0”,
“max_task_wait_time”: “-”,
“active_shards_percent”: “100.0%”

One node’s thread pool:

“node_name”: “elasticsearch”,
“node_id”: “id”,
“ephemeral_node_id”: “id”,
“pid”: “29484”,
“host”: “host”,
“ip”: “ip”,
“port”: “port”,
“name”: “bulk”,
“type”: “fixed”,
“active”: “0”,
“size”: “32”,
“queue”: “0”,
“queue_size”: “200”,
“rejected”: “117962”,
“largest”: “32”,
“completed”: “1892115601”,
“min”: “32”,
“max”: “32”,
“keep_alive”: null

Problem is; we see errors like this at our graylog indexer failures page;

{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$7@3e6cdfe3 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@76faa57b[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 1889113473]]"}

What would be best setting to avoid these errors. Do these errors indicates that we can’t write some of our incoming messages to elasticsearch? Or are these errors just warnings? Can graylog try write messages again to elasticsearch after it gets error?

I read some topics which are saying that increasing elasticsearch bulk queue capacity can’t actually solve the problem.

We’re planning to add one more graylog node but I think this is not related with problem.

Should we decrease graylog output batch size? Or should we add 2 master node without SSD disks for just operating the cluster. They will not be used for data operations. I read a suggestion as best practice to free master node from data operations.

Thanks for your help.
Best Regards,

you have two Graylog Servers that have 10 connections the same time to Elasticsearch. Raise output_batch_size to 1000 or 2000, lower outputbuffer_processor to 5 and set your index_refreshrate to 30 seconds in Elasticsearch.

In addition you would like to read:

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.