I’m decommissioning a node and moving all the shards off using cluster.routing.allocation.exclude._ip. While it is running I’m seeing the number_of_pending_tasks spike and drop. While it is high, messages stop getting processed until it drops back down again. Is there any way to prevent this? it is causing messages to back up in the journal.
the allocation is very resource intensive for your ES cluster - if you are running near the limit already this can be the reason that your ES is not healthy looking from Graylog perspective.
This same issue is occurring when adding nodes as well. I’ve added 2 new nodes and ES wants the number of shards to be relatively equal on all nodes and starts to move shards to them. When the relocation is allowed I see the Out messages drop to 0 at times, but when I disable the relocation with “cluster.routing.allocation.cluster_concurrent_rebalance” : 0
there is no drop in the Out messages.
This desire to have all nodes contain the same number of shards seems to create a bottleneck where all my existing nodes are doing nothing and my new nodes are doing all the ingestion of new messages, basically 100% CPU on both new nodes and 5-10% on all existing nodes.