Alternative site recovery on z/OS

We can recover a single queue manager or a queue sharing group, or consider disk mirroring.

See the following sections for more details:


Recovering a single queue manager at an alternative site

If a total loss of an IBM MQ computing center occurs, we can recover on another queue manager or queue sharing group at a recovery site. (See Recovering a queue sharing group at the alternative site for the alternative site recovery procedure for a queue sharing group.)

To recover on another queue manager at a recovery site, we must regularly back up the page sets and the logs. As with all data recovery operations, the objectives of disaster recovery are to lose as little data, workload processing (updates), and time, as possible.

At the recovery site:

  • The recovery queue managers must have the same names as the lost queue managers.
  • The system parameter module (for example, CSQZPARM) used on each recovery queue manager must contain the same parameters as the corresponding lost queue manager.

When you have done this, reestablish all your queue managers as described in the following procedure. This can be used to perform disaster recovery at the recovery site for a single queue manager. It assumes that all that is available are:

  • Copies of the archive logs and BSDSs created by normal running at the primary site (the active logs will have been lost along with the queue manager at the primary site).
  • Copies of the page sets from the queue manager at the primary site that are the same age or older than the most recent archive log copies available.

We can use dual logging for the active and archive logs, in which case we need to apply the BSDS updates to both copies:

  1. Define new page set data sets and load them with the data in the copies of the page sets from the primary site.
  2. Define new active log data sets.
  3. Define a new BSDS data set and use Access Method Services REPRO to copy the most recent archived BSDS into it.
  4. Use the print log map utility CSQJU004 to print information from this most recent BSDS. At the time this BSDS was archived, the most recent archived log you have would have just been truncated as an active log, and does not appear as an archived log. Record the STARTRBA and ENDRBA of this log.
  5. Use the change log inventory utility, CSQJU003, to register this latest archive log data set in the BSDS that we have just restored, using the STARTRBA and ENDRBA recorded in Step 4.
  6. Use the DELETE option of CSQJU003 to remove all active log information from the BSDS.
  7. Use the NEWLOG option of CSQJU003 to add active logs to the BSDS, do not specify STARTRBA or ENDRBA.
  8. Use CSQJU003 to add a restart control record to the BSDS. Specify CRESTART CREATE,ENDRBA=highrba, where highrba is the high RBA of the most recent archive log available (found in Step 4 ), plus 1.

    The BSDS now describes all active logs as being empty, all the archived logs you have available, and no checkpoints beyond the end of our logs.

  9. Restart the queue manager with the START QMGR command. During initialization, an operator reply message such as the following is issued:
    CSQJ245D +CSQ1 RESTART CONTROL INDICATES TRUNCATION AT RBA highrba.
    REPLY Y TO CONTINUE, N TO CANCEL
    

    Type Y to start the queue manager. The queue manager starts, and recovers data up to ENDRBA specified in the CRESTART statement.

See Use the IBM MQ utilities for information about using CSQJU003 and CSQJU004.

The following example shows sample input statements for CSQJU003 for steps 6, 7, and 8:
* Step 6
DELETE DSNAME=MQM2.LOGCOPY1.DS01
DELETE DSNAME=MQM2.LOGCOPY1.DS02
DELETE DSNAME=MQM2.LOGCOPY1.DS03
DELETE DSNAME=MQM2.LOGCOPY1.DS04
DELETE DSNAME=MQM2.LOGCOPY2.DS01
DELETE DSNAME=MQM2.LOGCOPY2.DS02
DELETE DSNAME=MQM2.LOGCOPY2.DS03
DELETE DSNAME=MQM2.LOGCOPY2.DS04

* Step 7
NEWLOG DSNAME=MQM2.LOGCOPY1.DS01,COPY1
NEWLOG DSNAME=MQM2.LOGCOPY1.DS02,COPY1
NEWLOG DSNAME=MQM2.LOGCOPY1.DS03,COPY1
NEWLOG DSNAME=MQM2.LOGCOPY1.DS04,COPY1
NEWLOG DSNAME=MQM2.LOGCOPY2.DS01,COPY2
NEWLOG DSNAME=MQM2.LOGCOPY2.DS02,COPY2
NEWLOG DSNAME=MQM2.LOGCOPY2.DS03,COPY2
NEWLOG DSNAME=MQM2.LOGCOPY2.DS04,COPY2

* Step 8
CRESTART CREATE,ENDRBA=063000

The things we need to consider for restarting the channel initiator at the recovery site are like those faced when using ARM to restart the channel initiator on a different z/OS image. SeeUse ARM in an IBM MQ networkfor more information. Your recovery strategy should also cover recovery of the IBM MQ product libraries and the application programming environments that use IBM MQ ( CICS, for example).

Other functions of the change log inventory utility (CSQJU003) can also be used in disaster recovery scenarios. The HIGHRBA function allows the update of the highest RBA written and highest RBA offloaded values within the bootstrap data set. The CHECKPT function allows the addition of new checkpoint queue records or the deletion of existing checkpoint queue records in the BSDS.

Attention: These functions might affect the integrity of the IBM MQ data. Only use them in disaster recovery scenarios under the guidance of IBM service personnel.

    Fast copy techniques

    If copies of all the page sets and logs are made while the queue manager is frozen, the copies will be a consistent set that can be used to restart the queue manager at an alternative site. They typically enable a much faster restart of the queue manager, as there is little media recovery to be performed.

    Use the SUSPEND QMGR LOG command to freeze the queue manager. This command flushes buffer pools to the page sets, takes a checkpoint, and stops any further log write activity. Once log write activity has been suspended, the queue manager is effectively frozen until we issue a RESUME QMGR LOG command. While the queue manager is frozen, the page sets and logs can be copied.

    By using copying tools such as FLASHCOPY or SNAPSHOT to rapidly copy the page sets and logs, the time during which the queue manager is frozen can be reduced to a minimum.

    Within a queue sharing group, however, the SUSPEND QMGR LOG command might not be such a good solution. To be effective, the copies of the logs must all contain the same point in time for recovery, which means that the SUSPEND QMGR LOG command must be issued on all queue managers within the queue sharing group simultaneously, and therefore the entire queue sharing group will be frozen for some time.


Recovering a queue sharing group

In the event of a prime site disaster, we can restart a queue sharing group at a remote site using backup data sets from the prime site. To recover a queue sharing group we need to coordinate the recovery across all the queue managers in the queue sharing group, and coordinate with other resources, primarily Db2 . This section describes these tasks in detail.

    CF structure media recovery

    Media recovery of a CF structure used to hold persistent messages on a shared queue, relies on having a backup of the media that can be forward recovered by the application of logged updates. Take backups of our CF structures periodically using the MQSC BACKUP CFSTRUCT command. All updates to shared queues (MQGETs and MQPUTs) are written on the log of the queue manager where the update is performed. To perform media recovery of a CF structure we must apply logged updates to that backup from the logs of all the queue managers that have used that CF structure. When we use the MQSC RECOVER CFSTRUCT command, IBM MQ automatically merges the logs from relevant queue managers, and applies the updates to the most recent backup.

    The CF structure backup is written to the log of the queue manager that processed the BACKUP CFSTRUCT command, so there are no additional data sets to be collected and transported to the alternative site.

    Backing up the queue sharing group at the prime site

    At the prime site we need to establish a consistent set of backups on a regular basis, which can be used in the event of a disaster to rebuild the queue sharing group at an alternative site. For a single queue manager, recovery can be to an arbitrary point in time, typically the end of the logs available at the remote site. However, where persistent messages have been stored on a shared queue, the logs of all the queue managers in the queue sharing group must be merged to recover shared queues, as any queue manager in the queue sharing group might have performed updates ( MQPUT s or MQGET s) on the queue.

    For recovery of a queue sharing group, we need to establish a point in time that is within the log range of the log data of all queue managers. However, as we can only forward recover media from the log, this point in time must be after the BACKUP CFSTRUCT command has been issued and after any page set backups have been performed. (Typically, the point in time for recovery might correspond to the end of a business day or week.)

    The following diagram shows time lines for two queue managers in a queue sharing group. For each queue manager, fuzzy backups of page sets are taken (see Method 2: Fuzzy backup ). On queue manager A, a BACKUP CFSTRUCT command is issued. Subsequently, an ARCHIVE LOG command is issued on each queue manager to truncate the active log, and copy it to media offline from the queue manager, which can be transported to the alternative site. End of log identifies the time at which the ARCHIVE LOG command was issued, and therefore marks the extent of log data typically available at the alternative site. The point in time for recovery must lie between the end of any page set or CF structure backups, and the earliest end of log available at the alternative site.

    Figure 1. Point in time for recovery for 2 queue managers in a queue sharing group

    IBM MQ records information associated with the CF structure backups in a table in Db2. Depending on we requirements, you might want to coordinate the point in time for recovery of IBM MQ with that for Db2, or it might be sufficient to take a copy of the IBM MQ CSQ.ADMIN_B_STRBACKUP table after the BACKUP CFSTRUCT commands have finished.

    To prepare for a recovery:
    1. Create page set backups for each queue manager in the queue sharing group.
    2. Issue a BACKUP CFSTRUCT command for each CF structure with the RECOVER(YES) attribute. We can issue these commands from a single queue manager, or from different queue managers within the queue sharing group to balance the workload.
    3. Once all the backups have completed, issue an ARCHIVE LOG command to switch the active log and create copies of the logs and BSDSs of each queue manager in the queue sharing group.
    4. Transport the page set backups, the archived logs, the archived BSDS of all the queue managers in the queue sharing group, and your chosen Db2 backup information, off-site.

    Recovering a queue sharing group at the alternative site

    Before we can recover the queue sharing group, we need to prepare the environment:

    1. If we have old information in your coupling facility from practice startups when you installed the queue sharing group, we need to clean this out first (if we do not have old information in the coupling facility, we can omit this step:
      1. Enter the following z/OS command to display the CF structures for this queue sharing group:
        D XCF,STRUCTURE,STRNAME= qsgname
        
      2. For all structures that start with the queue sharing group name, use the z/OS command SETXCF FORCE CONNECTION to force the connection off those structures:
        SETXCF FORCE,CONNECTION,STRNAME= strname,CONNAME=ALL
        
      3. Delete all the CF structures using the following command for each structure:
        SETXCF FORCE,STRUCTURE,STRNAME= strname
        

    2. Restore Db2 systems and data-sharing groups.
    3. Recover the CSQ.ADMIN_B_STRBACKUP table so that it contains information about the most recent structure backups taken at the prime site. Note: It is important that the STRBACKUP table contains the most recent structure backup information. Older structure backup information might require data sets that we have discarded as a result of the information given by a recent DISPLAY USAGE TYPE(DATASET) command, which would mean that your recovered CF structure would not contain accurate information.
    4. Run the ADD QMGR command of the CSQ5PQSG utility for every queue manager in the queue sharing group. This will restore the XCF group entry for each queue manager.

    To recover the queue managers in the queue sharing group:

    1. Define new page set data sets and load them with the data in the copies of the page sets from the primary site.
    2. Define new active log data sets.
    3. Define a new BSDS data set and use Access Method Services REPRO to copy the most recent archived BSDS into it.
    4. Use the print log map utility CSQJU004 to print information from this most recent BSDS. At the time this BSDS was archived, the most recent archived log you have would have just been truncated as an active log, and does not appear as an archived log. Record the STARTRBA, STARTLRSN, ENDRBA, and ENDLRSN values of this log.
    5. Use the change log inventory utility, CSQJU003, to register this latest archive log data set in the BSDS that we have just restored, using the values recorded in Step 4.
    6. Use the DELETE option of CSQJU003 to remove all active log information from the BSDS.
    7. Use the NEWLOG option of CSQJU003 to add active logs to the BSDS, do not specify STARTRBA or ENDRBA.
    8. Calculate the recoverylrsn for the queue sharing group. The recoverylrsn is the lowest of the ENDLRSNs across all queue managers in the queue sharing group (as recorded in Step 4 ), minus 1. For example, if there are two queue managers in the queue sharing group, and the ENDLRSN for one of them is B713 3C72 22C5, and for the other is B713 3D45 2123, the recoverylrsn is B713 3C72 22C4.
    9. Use CSQJU003 to add a restart control record to the BSDS. Specify:
      CRESTART CREATE,ENDLRSN= recoverylrsn
      
      where recoverylrsn is the value you recorded in Step 8.

      The BSDS now describes all active logs as being empty, all the archived logs you have available, and no checkpoints beyond the end of our logs.

      We must add the CRESTART record to the BSDS for each queue manager within the queue sharing group.

    10. Restart each queue manager in the queue sharing group with the START QMGR command. During initialization, an operator reply message such as the following is issued:
      CSQJ245D +CSQ1 RESTART CONTROL INDICATES TRUNCATION AT RBA highrba.
      REPLY Y TO CONTINUE, N TO CANCEL
      

      Reply Y to start the queue manager. The queue manager starts, and recovers data up to ENDRBA specified in the CRESTART statement.

      For IBM WebSphere MQ Version 7.0.1 and later, the first queue manager started can rebuild the admin structure partitions for other members of the queue sharing group as well as its own, and it is no longer necessary to restart each queue manager in the queue sharing group at this stage.

    11. When the admin structure data for all queue managers has been rebuilt, issue a RECOVER CFSTRUCT command for each CF application structure.

      If we issue the RECOVER CFSTRUCT command for all structures on a single queue manager, the log merge process is only performed once, so is quicker than issuing the command on a different queue manager for each CF structure, where each queue manager has to perform the log merge step.

    When conditional restart processing is used in a queue sharing group, IBM WebSphere MQ Version 7.0.1 and later queue managers, performing peer admin rebuild, check that peers BSDS contain the same CRESTART LRSN as their own. This is to ensure the integrity of the rebuilt admin structure. It is therefore important to restart other peers in the QSG, so they can process their own CRESTART information, before the next unconditional restart of any member of the group.


Use disk mirroring

Many installations now use disk mirroring technologies such as IBM Metro Mirror (formerly PPRC) to make synchronous copies of data sets at an alternative site. In such situations, many of the steps detailed become unnecessary as the IBM MQ page sets and logs at the alternative site are effectively identical to those at the prime site. Where such technologies are used, the steps to restart a queue sharing group at an alternative site may be summarized as:

  • Clear IBM MQ CF structures at the alternative site. (These often contain residual information from any previous disaster recovery exercise).
  • Restore Db2 systems and all tables in the database used by the IBM MQ queue sharing group.
  • Restart queue managers. Before IBM WebSphere MQ Version 7.0.1, it is necessary to restart each queue manager defined in the queue sharing group as each queue manage recovers its own partition of the admin structure during queue manager restart. After each queue manager has been restarted, those not on their home LPAR can be shut down again. For IBM WebSphere MQ Version 7.0.1 and later, the first queue manager started rebuilds the admin structure partitions for other members of the queue sharing group as well as its own, and it is no longer necessary to restart each queue manager in the queue sharing group.
  • After the admin structure has been rebuilt, recover the application structures.

IBM MQ Version 9.1.2, and later, supports use of zHyperWrite when writing to active logs mirrored using Metro Mirror. zHyperWrite can help reduce the performance impact of using Metro Mirror; see Use Metro Mirror with IBM MQ for more information.

Parent topic: Recovery and restart on z/OS