Graylog throws bulk errors from several indexes

alpykrbl · March 13, 2019, 6:52am

Hi,

We have 2 graylog nodes with these settings and specs;
Our outgoing traffic at yesterday was around 1TB

One node has: 2x14 Core
One node has: 2x12 Core
Both of them have: 64 GB Memory
And some of settings are;

output_batch_size = 200
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 10
outputbuffer_processors = 10

And our mongodb cluster located on 3 elasticsearch nodes.

And we have 8 elasticsearch nodes with these general and bulk settings and all of them have local SSD Disks for data operations.

Nodes have 31.3 GB heap size and average 60% of heap size is using by elasticsearch.
Nodes have 2x8 Core and 192 GB RAM

Cluster health;

{
“epoch”: “1552458774”,
“timestamp”: “09:32:54”,
“cluster”: “graylog”,
“status”: “green”,
“node.total”: “8”,
“node.data”: “8”,
“shards”: “50736”,
“pri”: “25368”,
“relo”: “0”,
“init”: “0”,
“unassign”: “0”,
“pending_tasks”: “0”,
“max_task_wait_time”: “-”,
“active_shards_percent”: “100.0%”
}

One node’s thread pool:

{
“node_name”: “elasticsearch”,
“node_id”: “id”,
“ephemeral_node_id”: “id”,
“pid”: “29484”,
“host”: “host”,
“ip”: “ip”,
“port”: “port”,
“name”: “bulk”,
“type”: “fixed”,
“active”: “0”,
“size”: “32”,
“queue”: “0”,
“queue_size”: “200”,
“rejected”: “117962”,
“largest”: “32”,
“completed”: “1892115601”,
“min”: “32”,
“max”: “32”,
“keep_alive”: null
},

Problem is; we see errors like this at our graylog indexer failures page;

{"type":"es_rejected_execution_exception","reason":"rejected execution of org.elasticsearch.transport.TransportService$7@3e6cdfe3 on EsThreadPoolExecutor[bulk, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@76faa57b[Running, pool size = 32, active threads = 32, queued tasks = 200, completed tasks = 1889113473]]"}

What would be best setting to avoid these errors. Do these errors indicates that we can’t write some of our incoming messages to elasticsearch? Or are these errors just warnings? Can graylog try write messages again to elasticsearch after it gets error?

I read some topics which are saying that increasing elasticsearch bulk queue capacity can’t actually solve the problem.

We’re planning to add one more graylog node but I think this is not related with problem.

Should we decrease graylog output batch size? Or should we add 2 master node without SSD disks for just operating the cluster. They will not be used for data operations. I read a suggestion as best practice to free master node from data operations.

Thanks for your help.
Best Regards,
Alpay

jan · March 13, 2019, 7:24am

you have two Graylog Servers that have 10 connections the same time to Elasticsearch. Raise output_batch_size to 1000 or 2000, lower outputbuffer_processor to 5 and set your index_refreshrate to 30 seconds in Elasticsearch.

In addition you would like to read:

system · March 27, 2019, 7:45am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
High load on the elasticsearch data nodes Graylog Central (peer support)	8	2924	September 27, 2019
Bulk Request Sizing Graylog Central (peer support)	3	1445	April 12, 2020
WARN [NodePingThread] Did not find meta info of this node. Re-registering Graylog Central (peer support)	3	1586	December 12, 2017
Graylog, log problem Graylog Central (peer support)	23	2124	March 18, 2019
Assistance Required: Enhancing Graylog Efficiency for Huge Log Volumes Graylog Central (peer support)	2	76	July 3, 2024

Graylog throws bulk errors from several indexes

Related topics