Configure high availability, recovery and restart

We can make the applications highly available by maintaining queue availability if a queue manager fails, and by recovering messages after server or storage failure.

On z/OS, high availability is built into the platform. Extra servant regions are spawned as needed, to meet increased demand. We can also improve server application availability by using queue sharing groups. See Shared queues and queue sharing groups.

On Multiplatforms, we can improve client application availability by using client reconnection to switch a client automatically between a group of queue managers, or to the new active instance of a multi-instance queue manager after a queue manager failure. Automatic client reconnect is not supported by IBM MQ classes for Java. A multi-instance queue manager is configured to run as a single queue manager on multiple servers. You deploy server applications to this queue manager. If the server running the active instance fails, execution is automatically switched to a standby instance of the same queue manager on a different server. If you configure server applications to run as queue manager services, they are restarted when a standby instance becomes the actively running queue manager instance.

Another way to increase server application availability on Multiplatforms is to deploy server applications to multiple computers in a queue manager cluster. From IBM WebSphere MQ Version 7.1 onwards, cluster error recovery reruns operations that caused problems until the problems are resolved. See Changes to cluster error recovery on servers other than z/OS. We can also configure IBM MQ for Multiplatforms as part of a platform-specific clustering solution such as:

Microsoft Cluster Server
HA clusters on IBM i
PowerHA for AIX (formerly HACMP on AIX) and other UNIX and Linux clustering solutions

On Linux systems, we can configure replicated data queue managers (RDQMs) to implement high availability or disaster recovery solutions. For high availability, instances of the same queue manager are configured on each node in a group of three Linux servers. One of the three instances is the active instance. Data from the active queue manager is synchronously replicated to the other two instances, so one of these instances can take over in the event of some failure. For disaster recovery, a queue manager runs on a primary node at one site, with a secondary instance of that queue manager located on a recovery node at a different site. Data is replicated between the primary instance and the secondary instance, and if the primary node is lost for some reason, the secondary instance can be made into the primary instance and started.

Another option for a high availability or disaster recovery solution is to deploy a pair of IBM MQ appliances. See High Availability and Disaster Recovery in the IBM MQ Appliance documentation.

A messaging system ensures that messages entered into the system are delivered to their destination. IBM MQ can trace the route of a message as it moves from one queue manager to another using the dspmqrte command. If a system fails, messages can be recovered in various ways depending on the type of failure, and the way a system is configured. IBM MQ maintains recovery logs of the activities of the queue managers that handle the receipt, transmission, and delivery of messages. It uses these logs for three types of recovery:

Restart recovery, when you stop IBM MQ in a planned way.
Failure recovery, when a failure stops IBM MQ.
Media recovery, to restore damaged objects.

In all cases, the recovery restores the queue manager to the state it was in when the queue manager stopped, except that any in-flight transactions are rolled back, removing from the queues any updates that were in-flight at the time the queue manager stopped. Recovery restores all persistent messages; nonpersistent messages might be lost during the process.

CAUTION:We cannot move recovery logs to a different operating system.

Automatic client reconnection
We can make client applications reconnect automatically, without writing any additional code, by configuring a number of components.
Console message monitoring
On IBM MQ for z/OS, there are a number of information messages issued by the queue manager or channel initiator that should be considered particularly significant. These messages do not in themselves indicate a problem, but can be useful in tracking because they do indicate a potential issue which might need addressing.
High availability configurations
To operate the IBM MQ queue managers in a high availability (HA) configuration, we can set up your queue managers to work either with a high availability manager, such as PowerHA for AIX (formerly HACMP ) or the Microsoft Cluster Service (MSCS), or with IBM MQ multi-instance queue managers. On Linux systems, we can also deploy replicated data queue managers (RDQMs), which use a quorum-based group to provide high availability.
Logging: Making sure that messages are not lost
IBM MQ records all significant changes to the persistent data controlled by the queue manager in a recovery log.
Backing up and restoring IBM MQ queue manager data
We can protect queue managers against possible corruption caused by hardware failures by backing up queue managers and queue manager data, by backing up the queue manager configuration only, and by using a backup queue manager.

Parent topic: Configure IBM MQ