backup and recovery, after Coupling Facility failure, recovery, restart, Coupling Facility structures, backup, persistent messages, queue-sharing groups" />
Coupling Facility failure
In the unlikely event of a Coupling Facility failure, any nonpersistent messages stored in the affected CF structures are lost. We can recover persistent messages using the RECOVER CFSTRUCT command.
To ensure that we can recover a CF structure in a reasonable time, you must take frequent backups, using the BACKUP CFSTRUCT command. We can choose to 'round-robin' the backups across all the queue managers in the queue-sharing group, or dedicate one queue manager to do all the backups.
Each backup is output to the active log data set of the queue manager taking the backup. The shared queue DB2 repository records the name of the CF structure being backed up, the name of the queue manager doing the backup, the RBA range for this backup on that queue manager's log, and the backup time.
You recover a CF structure by issuing a RECOVER CFSTRUCT command to the queue manager that you want to do the recovery; we can recover a single CF structure, or we can recover several CF structures simultaneously. The command uses the backup, located through the DB2 repository information, and forward recovers this to the point of failure. It does this by applying log records from any queue manager in the queue-sharing group that has performed an MQPUT or MQGET between the start of the backup and the time of failure, on any shared queue that maps to the CF structure. The resulting merging of the logs might require reading a considerable amount of log data, and so you are strongly advised to make frequent (say, hourly) backups, especially if there are large messages within the backup.
If a recoverable application structure has failed, any further application activity is prevented until the structure has been recovered. If the administration structure has also failed, all the queue managers in the queue-sharing group must be started before we can issue the RECOVER CFSTRUCT command.
If a CF structure fails, the action taken by connected queue managers depends on the following:
- The structure type (application or administration)
- The queue manager level (V5.3, or V6.0)
- The CFLEVEL of the CFSTRUCT object (2, 3, or 4)
- The type of failure reported by the XES component of z/OS to WebSphere MQ.
The following scenarios describe what happens when an administration structure fails:
- If the administration structure fails and the queue manager is running at V6.0, the structure is reallocated and rebuilt automatically without the queue manager terminating. Any serialized applications that have already connected to the queue manager can continue processing. Any serialized application attempting to connect with MQCNO_SERIALIZE_CONN_TAG_QSG or MQCNO_RESTRICT_CONN_TAG_QSG receive the MQRC_CONN_TAG_NOT_USABLE return code until all the queue managers in the queue-sharing group have rebuilt their administration structure entries. Certain actions on the shared queue are suspended until the queue manager has reconnected to the administration structure and finished rebuilding the entries in the structure. The suspended actions include the following:
- Opening and closing of shared queues.
- Committing or backing out units of recovery.
- Serialized applications connecting to or disconnecting from the queue manager.
When the administration structure entries for the queue manager have been rebuilt, the suspended actions continue.
We cannot backup or recover an application structure until all the queue managers in the queue-sharing group have rebuilt their administration structure entries.
- If an administration structure fails and the queue manager is running at V5.3, or 5.3.1 or lower, the queue manager terminates
The following scenarios describe what happens when an application structure fails:
- If an application structure fails and the CFLEVEL of the CF structure is 1 or 2, the queue manager terminates.
- If an application structure fails and the CFLEVEL is 3 or 4, and the error reported by XES is a connection failure, the queue manager terminates to allow other queue managers in the queue-sharing group that might still have connectivity to perform peer level recovery, and so make messages more available.
If a CF structure fails, V5.3 and V6.0 queue managers connected to a CFLEVEL(3) or CFLEVEL(4) CF structure continue to run, and applications that do not use the queues in the failed structure can continue normal processing. However, applications that attempt operations on queues in the failed structure receive errors until the RECOVER CFSTRUCT command has successfully rebuilt the failed structure, at which point new requests to open queues in the structure are allowed.