Datanode remove: what exactly happens?

**1. Describe your incident:
**
My setup: 1 graylog headnode + 1 datanode. shards: 1 replicas: 0. I’m trying to add another datanode and retire the old one. Is it possible to do with graylog web GUI without going into API stuff?

If I add another datanode and remove the old one, would I loose data? Or else graylog simply makes sure all the data from old datanode is migrated to new one before actually removing it?

2. Describe your environment:

  • OS Information: Linux

  • Package Version: 6.1.8

  • Service logs, configurations, and environment variables: shards:1 replicas: 0

3. What steps have you already taken to try and solve the problem?
I searched the forum, and googled. Documentation is minimal on datanode and remove action.

4. How can the community help?

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

This operation will require you to interface with the Opensearch API, you can do this be pulling a client cert from the Graylog UI under cluster config.

Assuming the second Graylog Data Node has been added to the cluster, you could disable routing to the older node with the below.

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "OLD_DATA_NODE_IP"
  }
}

Once that is done you should be able to see the shards drain off of the old node down to zero.

GET _cat/shards?v

Assuming the indices are comprised of a single shards and without replicas this should be a safe operation. Once the node is drained of shards it can be shutdown.

Last time I tried, API thing didn’t work. I’ll generate a new 3rd party cert and try again.

@Wine_Merchant what’s is your guess in regards to removing a datanode: Does graylog do it gracefully or else in my case, it would lead to a data loss? I guess my question is about what goes on under the hood, when I click ‘Remove’.

Managed to move all shards to new datanode via API by using a new client cert with all_access role. Thanks @Wine_Merchant . I’m still not able to retire the old node though. There are two complications:

  • turns out Graylog has some internal shards that it uses which’re set to non-zero replicas. In any case, these replicas can be discarded since they have primaries.
  • Before removing old datanode, I can’t make sure graylog is in a good state since if the node is disabled, graylog throws ‘cluster_manager_not_discovered_exception’. it looks like graylog is working ok but the new datanode can’t take over the datanode cluster master role due to quorum issue I think.

I"ve create a new issue for this:

Hey @Sinan,

Glad you are one step further and nice work on creating another thread.

I’ve commented there.

Now that I think about this again, shouldn’t Graylog web UI automatically handle routing when I click “Remove” and if it’s not possible prompt an error to the user.

The issue is removing a node from a two node cluster will always be an issue as quorum requires at least two voting nodes to elect a leader.

makes sense. So if quorum is not issue, i.e. if I have 3 datanodes, graylog will automatically handle allocation of shards if I click remove on a datanode?

Hey @Sinan

I ran a quick test, spun up a 3 node cluster and created an index with 3 shards which were equally placed across all three nodes. Upon removal of node1, cluster remains green and node1’s shards are now on node3.

It appears that adding the node1 back to the cluster, shards require a manual rebalance to move them back.

good to know. Thanks.