High availability for IBM MQ in containers

High availability for IBM MQ in containers

We have two main choices for high availability with IBM MQ Advanced certified container: Multi-instance queue manager (which is an active-standby pair, using a shared, networked file system) and Single resilient queue manager (which offers a simple approach for HA using networked storage).

We should consider separately message and service availability. With IBM MQ for Multiplatforms, a message is stored on exactly one queue manager. So if that queue manager becomes unavailable, you temporarily lose access to the messages it holds. To achieve high message availability, we need to be able to recover a queue manager as quickly as possible. We can achieve service availability by having multiple instances of queues for client applications to use, for example by using an IBM MQ uniform cluster.

A queue manager can be thought of in two parts: the data stored on disk, and the running processes that allow access to the data. Any queue manager can be moved to a different Kubernetes Node, as long as it keeps the same data (provided by Kubernetes Persistent Volumes) and is still addressable across the network by client applications. In Kubernetes, a Service is used to provide a consistent network identity.

IBM MQ relies on the availability of the data on the persistent volumes. Therefore, the availability of the storage providing the persistent volumes is critical to queue manager availability, because IBM MQ cannot be more available than the storage it is using. To tolerate an outage of an entire availability zone, we need to use a volume provider that replicates disk writes to another zone.

Multi-instance queue manager

Multi-instance queue managers involve an active and a standby Kubernetes Pod, which run as part of a Kubernetes Stateful Set with exactly two replicas and a set of Kubernetes Persistent Volumes. The queue manager transaction logs and data are held on two persistent volumes, using a shared file system.

Multi-instance queue managers require both the active and the standby Pods to have concurrent access to the persistent volume. To configure this, we use Kubernetes Persistent Volumes with access mode set to ReadWriteMany. The volumes must also meet the IBM MQ requirements for shared file systems, because IBM MQ relies on the automatic release of file locks to instigate a queue manager failover. IBM MQ produces a list of tested file systems.

The recovery times for a multi-instance queue manager are controlled by the following factors:

How long it takes after a failure occurs for the shared file system to release the locks originally taken by the active instance.
How long it takes for the standby instance to acquire the locks and then start.
How long it takes for the Kubernetes Pod readiness probe to detect that the container is ready. This is configurable.
How long it takes for IBM MQ clients to reconnect.

Single resilient queue manager

A single resilient queue manager is a single instance of a queue manager running in a single Kubernetes Pod, where Kubernetes monitors the queue manager and replaces the Pod as necessary.

The IBM MQ requirements for shared file systems also apply when using a single resilient queue manager (except for lease-based locking), but we do not need to use a shared file system. We can use block storage, with a suitable file system on top. For example, xfs or ext4.

The recovery times for a single resilient queue manager are controlled by the following factors:

How long it takes for the liveness probe to run, and how many failures it tolerates. This is configurable.
How long the Kubernetes Scheduler takes to re-schedule the failed Pod to a new Node.
How long it takes to download the container image to the new Node. If we use an imagePullPolicy value of IfNotPresent, then the image might already be available on that Node.
How long it takes for the new queue manager instance to start.
How long it takes for the Kubernetes Pod readiness probe to detect that the container is ready. This is configurable.
How long it takes for IBM MQ clients to reconnect.

Important:
Although the single resilient queue manager pattern offers some benefits, we need to understand whether we can reach your availability goals with the limitations around Node failures.
In Kubernetes, a failing Pod is typically recovered quickly; but the failure of an entire Node is handled differently. If the Kubernetes Master Node loses contact with a worker node, it cannot determine if the node has failed, or if it has simply lost network connectivity. Therefore Kubernetes takes no action in this case until one of the following events occurs:

The node recovers to a state where the Kubernetes Master Node can communicate with it.
An administrative action is taken to explicitly delete the Pod on the Kubernetes Master Node. This does not necessarily stop the Pod from running, but just deletes it from the Kubernetes store. This administrative action must therefore be taken very carefully.

Parent topic: Plan for IBM MQ in containers

Related information

High availability configurations

Last updated: 2020-10-04