Replacing a failed node in a DR/HA configuration

If one of the nodes in either of our HA groups fails, we can replace it.

The procedure varies according to whether the node that we are replacing is a primary or a secondary in the DR configuration. In either case, the new node must have an identical configuration to the node that we are replacing, that is, it must have the same hostname, same IP addresses, and so on.

We might also encounter the situation where you have completely lost the HA group at your main or recovery site and have to replace the entire HA group.

Procedure

For a replacement node that is a primary in the DR configuration, complete the following steps on the new node:
1. Create an rdqm.ini file that matches the files on the other nodes, and then run the rdqmadm -c command (see Defining the Pacemaker cluster (HA group)).
2. Run the crtmqm -sxs -rr p qmanager command to recreate each DR/HA RDQM (see Create DR/HA RDQMs).
For a replacement node that is a secondary in the DR configuration, complete the following steps on the new node:
1. Create an rdqm.ini file that matches the files on the other nodes, and then run the rdqmadm -c command (see Defining the Pacemaker cluster (HA group)).
2. Run the crtmqm -sx -rr s qmanager command to recreate each DR/HA RDQM (see Create DR/HA RDQMs).
To replace an entire HA group, complete the following steps:
1. If you lose the entire HA group at the DR primary site (that is, the main site), then we must follow the steps to perform a managed failover to the DR secondary site to keep running your DR/HA RDQMs (see Operate in a disaster recovery environment). (If you lose an entire HA group at the recovery site, your DR/HA RDQMs continue to run on the main site without your intervention.)
2. Recreate the HA group on your three replacement nodes, as described in Configure HA groups for DR/HA RDQMs.
3. Recreate your DR/HA RDQMs on the new HA group as described in Create DR/HA RDQMs.
4. If required, perform a managed failover from your recovery site back to your main site.

Parent topic: RDQM disaster recovery and high availability