Replacing a failed node in a disaster recovery configuration
If you lose one of the nodes in a disaster recovery configuration, we can replace the node and restore the disaster recovery configuration by following this procedure.
If a disaster occurs such that the node in the main site is beyond repair, we can replace the failed node while the queue manager runs on the recovery node and then restore the original disaster recovery configuration. The replacement node must assume the identity of the failed node: the name and IP address must be the same.
We must either be logged in as root or logged in as a user who belongs to the mqm group and has the necessary sudo configuration.
Procedure
Following the loss of the queue manager on the main site, take the following steps:
-
On the recovery node, run the following commands to make the secondary queue manager assume the
primary role:
rdqmdr -m QMname -p
Where QMname is the name of the queue manager. -
Retrieve the command that we will need to run on the replacement primary node to reconfigure
disaster recovery:
rdqmdr -m QMname -d
Copy the output of this command. -
Run the following command to start the queue manager:
strmqm QMname
- Ensure that the applications reconnect to the queue manager on the recovery node. Provided that we have defined your channels with a list of alternative connection names, specifying your primary and secondary queue managers, then the applications will automatically connect to the new primary queue manager.
- Replace the failed node on your main site and configure it to have the same name and IP address that we used for disaster recovery on the original node. Then configure disaster recovery by running the crtmqm command that you copied in step 2. You now have a secondary instance of the queue manager, and the primary instance synchronizes its data with the secondary instance.
- End the current primary instance.
-
After the synchronization has completed, make the primary instance that is running on the
recovery node into the secondary once more:
rdqmdr -m QMname -s
-
On the replacement primary node, make the secondary instance of the queue manager into the
primary instance:
rdqmdr -m QMname -p
-
On the replacement primary node, start the queue manager:
strmqm QMname
We have now restored the configuration as it was before the failure at your main site.
Parent topic: Operate in a disaster recovery environment
Related information