RDQM high availability

RDQM (replicated data queue manager) is a high availability solution that is available on Linux platforms.

An RDQM configuration consists of three servers configured in a high availability (HA) group, each with an instance of the queue manager. One instance is the running queue manager, which synchronously replicates its data to the other two instances. If the server running this queue manager fails, another instance of the queue manager starts and has current data to operate with. The three instances of the queue manager share a floating IP address, so clients only need to be configured with a single IP address. Only one instance of the queue manager can run at any one time, even if the HA group becomes partitioned due to network problems. The server running the queue manager is known as the 'primary', each of the other two servers is known as a 'secondary'.

Three nodes are used to greatly reduce the possibility of a split-brain situation arising. In a two-node High Availability system split-brain can occur when the connectivity between the two nodes is broken. With no connectivity, both nodes could run the queue manager at the same time, accumulating different data. When connection is restored, there are two different versions of the data (a 'split-brain'), and manual intervention is required to decide which data set to keep, and which to discard.

RDQM uses a three node system with quorum to avoid the split-brain situation. Nodes that can communicate with at least one of the other nodes form a quorum. Queue managers can only run on a node that has quorum. The queue manager cannot run on a node which is not connected to at least one other node, so can never run on two nodes at the same time:

If a single node fails, the queue manager can run on one of the other two nodes. If two nodes fail, the queue manager cannot run on the remaining node because the node does not have quorum (the remaining node cannot tell whether the other two nodes have failed, or they are still running and it has lost connectivity).
If a single node loses connectivity, the queue manager cannot run on this node because the node does not have quorum. The queue manager can run on one of the remaining two nodes, which do have quorum. If all nodes lose connectivity, the queue manager is unable to run on any of the nodes, because none of the nodes have quorum.

Note: The IBM MQ Console does not support replicated data queue managers. We can use IBM MQ Explorer with replicated data queue managers, but this does not display information specific to the RDQM features.

The group configuration of the three nodes is handled by Pacemaker. The replication between the three nodes is handled by DRBD. (See https://clusterlabs.org/pacemaker/ for information about Pacemaker and https://docs.linbit.com/docs/users-guide-9.0/ for information about DRBD.)

We can back up your replicated data queue managers by using the process described in Backing up queue manager data. Stopping the queue manager and backing it up has no effect on the node monitoring done by the RDQM configuration.

The following figure shows a typical deployment with an RDQM running on each of the three nodes in the HA group.

Figure 1. Example of HA group with three RDQMs

In the next figure, Node3 has failed, the Pacemaker links have been lost, and queue manager QM3 runs on Node2 instead.

Note: When queue managers fail over to another node they retain the state they had at failover. Queue managers that were running are started, queue managers that were stopped remain stopped.

Requirements for RDQM HA solution
We must meet a number of requirements before you configure the RDQM high availability (HA) group.
Defining the Pacemaker cluster (HA group)
The HA group is a Pacemaker cluster. You define the Pacemaker cluster by editing the /var/mqm/rdqm.ini file and running the rdqmadm command.
Create an HA RDQM
You use the crtmqm command to create a high availability replicated data queue manager (RDQM).
Set the Preferred Location for an RDQM
The Preferred Location for a replicated data queue manager (RDQM) identifies the node where the RDQM should run if that node is available.
Create and deleting a floating IP address
A floating IP address enables a client to use the same IP address for a replicated data queue manager (RDQM) regardless of which node in the HA group it is running on.
Starting, stopping, and displaying the state of an HA RDQM
You use variants of standard IBM MQ control commands to start, stop, and view the current state of a replicated data queue manager (RDQM).
View RDQM and HA group status
We can view the status of the HA group and of individual replicated data queue managers (RDQMs).
Change IP addresses in high availability configurations
If we change the IP addresses of any of the interfaces in a high availability configuration, high availability operation is no longer available and the queue manager will not run on the node where the addresses were changed.
Replacing a failed node in a high availability configuration
If one of the nodes in your HA group fails, we can replace it.
Resolving a split-brain situation
There are situations where certain failure sequences in an HA group could lead to a split-brain situation being reported.

Parent topic: High availability configurations

RDQM high availability

Related information