Queue Manager Clusters: Cluster-administration considerations

Cluster-administration considerations

Let us now look at some considerations affecting the system administrator.

Maintaining a queue manager

From time to time, you might need to perform maintenance on a queue manager that is part of a cluster. For example, you might need to take backups of the data in its queues, or apply fixes to the software. If the queue manager hosts any queues, its activities must be suspended. When the maintenance is complete, its activities can be resumed.
To suspend a queue manager, issue the SUSPEND QMGR command, for example:
SUSPEND QMGR CLUSTER(SALES)
This sends a notification to the queue managers in the cluster SALES advising them that this queue manager has been suspended. The purpose of the SUSPEND QMGR command is only to advise other queue managers to avoid sending messages to this queue manager if possible. It does not mean that the queue manager is disabled. While the queue manager is suspended the workload management routines avoid sending messages to it, other than messages that have to be handled by that queue manager. Messages that have to be handled by that queue manager include messages sent by the local queue manager. The workload management routines choose the local queue manager whenever possible, even if it is suspended.
When the maintenance is complete the queue manager can resume its position in the cluster. It should issue the command RESUME QMGR, for example:
RESUME QMGR CLUSTER(SALES)
This notifies to the full repositories that the queue manager is available again. The full repository queue managers disseminate this information to other queue managers that have requested updates to information concerning this queue manager.
You can enforce the suspension of a queue manager by using the FORCE option on the SUSPEND QMGR command, for example:
SUSPEND QMGR CLUSTER(SALES) MODE(FORCE)
This forcibly stops all inbound channels from other queue managers in the cluster. If you do not specify MODE(FORCE), the default MODE(QUIESCE) applies.

Refreshing a queue manager

A queue manager can make a fresh start in a cluster. This is unlikely to be necessary in normal circumstances but you might be asked to do this by your IBM Support Center. You can issue the REFRESH CLUSTER command from a queue manager to remove all cluster queue-manager objects and all cluster queue objects relating to queue managers other than the local one, from the local full repository. The command also removes any auto-defined channels that do not have messages on the cluster transmission queue and that are not attached to a full repository queue manager. Effectively, the REFRESH CLUSTER command allows a queue manager to be cold started with respect to its full repository content. (WebSphere MQ ensures that no data is lost from your queues.)

Recovering a queue manager

To recover a queue manager in a cluster, restore the queue manager from a linear log. (See the WebSphere MQ System Administration Guide for details)
If you have to restore from a point-in-time backup, issue the REFRESH CLUSTER command on the restored queue manager for all clusters in which the queue manager participates.
There is no need to issue the REFRESH CLUSTER command on any other queue manager.

Maintaining the cluster transmission queue

The availability and performance of the cluster transmission queue are essential to the performance of clusters. Make sure that it does not become full, and take care not to accidentally issue an ALTER command to set it either get-disabled or put-disabled. Also make sure that the medium the cluster transmission queue is stored on (for example z/OS page sets) does not become full. For performance reasons, on z/OS set the INDXTYPE of the cluster transmission queue to CORRELID.

What happens when a queue manager fails?

If a message-batch is sent to a particular queue manager and that queue manager becomes unavailable this is what happens:

With the exception of non-persistent messages on a fast channel (which might be lost) the undelivered batch of messages is backed out to the cluster transmission queue on the sending queue manager.

If the backed-out batch of messages is not in doubt and the messages are not bound to the particular queue manager, the workload management routine is called. The workload management routine selects a suitable alternative queue manager and the messages are sent there.
Messages that have already been delivered to the queue manager, or are in doubt, or have no suitable alternative, wait until the original queue manager becomes available again.

The restart can be automated using Automatic Restart Management (ARM) on z/OS, HACMP on AIX, or any other restart mechanism available on your platform.

What happens when a repository fails?

Cluster information is carried to repositories (whether full or partial) on a local queue called SYSTEM.CLUSTER.COMMAND.QUEUE. If this queue fills up, perhaps because the queue manager has stopped working, the cluster-information messages are routed to the dead-letter queue. If you observe that this is happening, from the messages on your queue-manager log or z/OS system console, you need to run an application to retrieve the messages from the dead-letter queue and reroute them to the correct destination.
If errors occur on a repository queue manager, messages tell you what error has occurred and how long the queue manager will wait before trying to restart. On WebSphere MQ for z/OS the SYSTEM.CLUSTER.COMMAND.QUEUE is get-disabled. When you have identified and resolved the error, get-enable the SYSTEM.CLUSTER.COMMAND.QUEUE so that the queue manager can restart successfully.
In the unlikely event of a queue manager's repository running out of storage, storage allocation errors appear on your queue-manager log or z/OS system console. If this happens, stop and then restart the queue manager. When the queue manager is restarted, more storage is automatically allocated to hold all the repository information.

What happens if I put-disable a cluster queue?

When a cluster queue is put-disabled, this situation is reflected in the full repository of each queue manager that is interested in that queue. The workload management algorithm tries to send messages to destinations that are put-enabled. If there are no put-enabled destinations and no local instance of a queue, an MQOPEN call that specified MQOO_BIND_ON_OPEN returns a return code of MQRC_CLUSTER_PUT_INHIBITED to the application. If MQOO_BIND_NOT_FIXED is specified, or there is a local instance of the queue, an MQOPEN call succeeds but subsequent MQPUT calls fail with return code MQRC_PUT_INHIBITED.
You can write a user exit program to modify the workload management routines so that messages can be routed to a destination that is put-disabled. If a message arrives at a destination that is put-disabled (because it was in flight at the time the queue became disabled or because a workload exit chose the destination explicitly), the workload management routine at the queue manager can choose another appropriate destination if there is one, or place the message on the dead-letter queue, or if there is no dead-letter queue, return the message to the originator.

How long do the queue manager repositories retain information?

When a queue manager sends out some information about itself, for example to advertise the creation of a new queue, the full and partial repository queue managers store the information for 30 days. To prevent this information from expiring, queue managers automatically resend all information about themselves after 27 days. If a partial repository sends a new request for information part way through the 30 day lifetime it sees an expiry time of the remaining period. When information expires, it is not immediately removed from the repository. Instead it is held for a grace period of 60 days. If no update is received within the grace period, the information is removed. The grace period allows for the fact that a queue manager may have been temporarily out of service at the expiry date. If a queue manager becomes disconnected from a cluster for more than 90 days, it stops being part of the cluster. However, if it reconnects to the network it will become part of the cluster again. Full repositories do not use information that has expired to satisfy new requests from other queue managers.
Similarly, when a queue manager sends a request for up-to-date information from a full repository, the request lasts for 30 days. After 27 days WebSphere MQ checks the request. If it has been referenced during the 27 days, it is remade automatically. If not, it is left to expire and is remade by the queue manager if it is needed again. This prevents a build up of requests for information from dormant queue managers.

Cluster channels

Although using clusters relieves you of the need to define channels (because WebSphere MQ defines them for you), the same channel technology used in distributed queuing is used for communication between queue managers in a cluster. To understand about cluster channels, you need to be familiar with matters such as:

How channels operate
How to find their status
How to use channel exits

These topics are all discussed in the WebSphere MQ Intercommunication book and the advice given there is generally applicable to cluster channels, but you might want to give some special consideration to the following:

When you are defining cluster-sender channels and cluster-receiver channels choose a value for HBINT or KAINT that will detect a network or queue manager failure in a useful amount of time but not burden the network with too many heartbeat or keep alive flows. Bear in mind that choosing a short time, for example, less than about 10 seconds, will give false failures if your network sometimes slows down and introduces delays of this length.

Set the BATCHHB value if you want to reduce the window for causing a marooned message because it has got "in doubt" on a failed channel. This is more likely to occur if the message traffic along the channel is sporadic with long periods of time between bursts of messages, and during which a network failure is likely. This is sometimes a situation that is artificially induced when testing fail over of cluster queue managers, and may not be relevant on the production systems.
If the cluster-sender end of a channel fails and subsequently tries to restart before the heartbeat or keep alive has detected the failure, the restart is rejected if the cluster-receiver end of the channel has remained active. To avoid this, you can arrange for the cluster-receiver channel to be terminated and restarted, when a cluster-sender channel attempts to restart.

On WebSphere MQ for z/OS

Control this using the ADOPTMCA and ADOPTCHK parameters of CSQ6CHIP. See the WebSphere MQ for z/OS System Setup Guide for more information.

On platforms other than z/OS

Control this using the AdoptNewMCA, AdoptNewMCATimeout, and AdoptNewMCACheck attributes in the qm.ini file or the Windows NT Registry. See the WebSphere MQ System Administration Guide Guide for more information.

WebSphere is a trademark of the IBM Corporation in the United States, other countries, or both.

IBM is a trademark of the IBM Corporation in the United States, other countries, or both.