Queue manager clusters troubleshooting

Use the checklist given here, and the advice given in the subtopics, to help you to detect and deal with problems when we use queue manager clusters.

Before starting

If your problems relate to publish/subscribe messaging using clusters, rather than to clustering in general, see Routing for publish/subscribe clusters: Notes on behavior.

Procedure

Check that the cluster channels are all paired.
Each cluster sender channel connects to a cluster receiver channel of the same name. If there is no local cluster receiver channel with the same name as the cluster sender channel on the remote queue manager, then it won't work.
Check that your channels are running. No channels should be in RETRYING state permanently. Show which channels are running using the following command:
```
runmqsc display chstatus(*)
```
If we have channels in RETRYING state, there might be an error in the channel definition, or the remote queue manager might not be running. While channels are in this state, messages are likely to build up on transmit queues. If channels to full repositories are in this state, then the definitions of cluster objects (for example queues and queue managers) become out-of-date and inconsistent across the cluster.
Check that no channels are in STOPPED state. Channels go into STOPPED state when you stop them manually. Channels that are stopped can be restarted using the following command:
```
runmqsc start channel(xyz)
```
A clustered queue manager auto-defines cluster channels to other queue managers in a cluster, as required. These auto-defined cluster channels start automatically as needed by the queue manager, unless they were previously stopped manually. If an auto-defined cluster channel is stopped manually, the queue manager remembers that it was manually stopped and does not start it automatically in the future. For to stop a channel, either remember to restart it again at a convenient time, or else issue the following command:
```
stop channel(xyz) status(inactive)
```
The status(inactive) option allows the queue manager to restart the channel at a later date if it needs to do so.
Check that all queue managers in the cluster are aware of all the full repositories. We can do this using the following command:
```
runmqsc display clusqmgr(*) qmtype
```
Partial repositories might not be aware of all other partial repositories. All full repositories should be aware of all queue managers in the cluster. If cluster queue managers are missing, this might mean that certain channels are not running correctly.
Check that every queue manager (full repositories and partial repositories) in the cluster has a manually defined cluster receiver channel running and is defined in the correct cluster. To see which other queue managers are talking to a cluster receiver channel, use the following command:
```
runmqsc display channel(*) rqmname
```
Check that each manually defined cluster receiver has a conname parameter defined to be ipaddress(port). Without a correct connection name, the other queue manager does not know the connection details to use when connecting back.
Check that every partial repository has a manually defined cluster sender channel running to a full repository, and defined in the correct cluster.
The cluster sender channel name must match the cluster receiver channel name on the other queue manager.
Check that every full repository has a manually defined cluster sender channel running to every other full repository, and defined in the correct cluster.
The cluster sender channel name must match the cluster receiver channel name on the other queue manager. Each full repository does not keep a record of what other full repositories are in the cluster. It assumes that any queue manager to which it has a manually defined cluster sender channel is a full repository.
Check the dead letter queue.
Messages that the queue manager cannot deliver are sent to the dead letter queue.
Check that, for each partial repository queue manager, you have defined a single cluster-sender channel to one of the full repository queue managers. This channel acts as a "bootstrap" channel through which the partial repository queue manager initially joins the cluster.
Check that the intended full repository queue managers are actual full repositories and are in the correct cluster. We can do this using the following command:
```
runmqsc display qmgr repos reposnl
```
Check that messages are not building up on transmit queues or system queues. We can check transmit queues using the following command:
```
runmqsc display ql(*) curdepth where (usage eq xmitq)
```
We can check system queues using the following command:
```
display ql(system*) curdepth
```

Application balancing trouble shooting
A list of symptoms and solutions associated with application balancing, using the DISPLAY APSTATUS command.
Application issues seen when running REFRESH CLUSTER
Issuing REFRESH CLUSTER is disruptive to the cluster. It might make cluster objects invisible for a short time until the REFRESH CLUSTER processing completes. This can affect running applications. These notes describe some of the application issues you might see.
A cluster-sender channel is continually trying to start
Check the queue manager and listener are running, and the cluster-sender and cluster-receiver channel definitions are correct.
DISPLAY CLUSQMGR shows CLUSQMGR names starting SYSTEM.TEMP.
The queue manager has not received any information from the full repository queue manager that the manually defined CLUSSDR channel points to. Check that the cluster channels are defined correctly.
Return code= 2035 MQRC_NOT_AUTHORIZED
The RC2035 reason code is displayed for various reasons including an error on opening a queue or a channel, an error received when you attempt to use a user ID that has administrator authority, an error when using an IBM MQ JMS application, and opening a queue on a cluster. MQS_REPORT_NOAUTH and MQSAUTHERRORS can be used to further diagnose RC2035.
Return code= 2085 MQRC_UNKNOWN_OBJECT_NAME when trying to open a queue in the cluster
Return code= 2189 MQRC_CLUSTER_RESOLUTION_ERROR when trying to open a queue in the cluster
Make sure that the CLUSSDR channels to the full repositories are not continually trying to start.
Return code=2082 MQRC_UNKNOWN_ALIAS_BASE_Q opening a queue in the cluster
Applications get rc=2082 MQRC_UNKNOWN_ALIAS_BASE_Q when trying to open a queue in the cluster.
Messages are not arriving on the destination queues
Make sure that the corresponding cluster transmission queue is empty and also that the channel to the destination queue manager is running.
Messages put to a cluster alias queue go to SYSTEM.DEAD.LETTER.QUEUE
A cluster alias queue resolves to a local queue that does not exist.
A queue manager has out of date information about queues and channels in the cluster
No changes in the cluster are being reflected in the local queue manager
The repository manager process is not processing repository commands, possibly because of a problem with receiving or processing messages in the command queue.
DISPLAY CLUSQMGR displays a queue manager twice
Use the RESET CLUSTER command to remove all traces of an old instance of a queue manager.
A queue manager does not rejoin the cluster
After issuing a RESET or REFRESH cluster command the channel from the queue manager to the cluster might be stopped. Check the cluster channel status and restart the channel.
Workload balancing set on a cluster-sender channel is not working
Any workload balancing you specify on a cluster-sender channel is likely to be ignored. Instead, specify the cluster workload channel attributes on the cluster-receiver channel at the target queue manager.
Out of date information in a restored cluster
After restoring a queue manager, its cluster information is out of date. Refresh the cluster information with the REFRESH CLUSTER command.
Cluster queue manager force removed from a full repository by mistake
Restore the queue manager to the full repository by issuing the command REFRESH CLUSTER on the queue manager that was removed from the repository.
Possible repository messages deleted
Messages destined for a queue manager were removed from the SYSTEM.CLUSTER.TRANSMIT.QUEUE in other queue managers. Restore the information by issuing the REFRESH CLUSTER command on the affected queue manager.
Two full repositories moved at the same time
If you move both full repositories to new network addresses at the same time, the cluster is not updated with the new addresses automatically. Follow the procedure to transfer the new network addresses. Move the repositories one at a time to avoid the problem.
Unknown state of a cluster
Restore the cluster information in all the full repositories to a known state by rebuilding the full repositories from all the partial repositories in the cluster.
What happens when a cluster queue manager fails
When a cluster queue manager fails, some undelivered messages are sent to other queue managers in the cluster. Messages that are in-flight wait until the queue manager is restarted. Use a high-availability mechanism to restart a queue manager automatically.
What happens when a repository fails
How you know a repository has failed and what to do to fix it?
What happens if a cluster queue is disabled for MQPUT
All instances of a cluster queue that is being used for workload balancing might be disabled for MQPUT. Applications putting a message to the queue either receive a MQRC_CLUSTER_PUT_INHIBITED or a MQRC_PUT_INHIBITED return code. We might want to modify this behavior.

Parent topic: IBM MQ Troubleshooting and support

Related concepts

Reason codes and exceptions

Related tasks

Related information

Configure a queue manager cluster