Recovering damaged objects

Recovering damaged objects

There are ways in which an IBM MQ object can become unusable, for example because of inadvertent damage. You must then recover either your complete system or some part of it. The action required depends on when the damage is detected, whether the log method selected supports media recovery, and which objects are damaged.

Media recovery

From IBM MQ Version 9.0.2, on a linear logging queue manager, media images can be recorded only for objects that are recoverable. For example, you need to consider the IMGRCOVO and IMGRCOVQ options.

Similarly, we can recover a subset of objects only, defined as media recoverable, from their media images on a linear logging queue manager. In the event that an object, that is not defined as media recoverable is damaged, the options for that object are the same as those for a circular logging queue manager.

Media recovery re-creates objects from information recorded in a linear log. For example, if an object file is inadvertently deleted, or becomes unusable for some other reason, media recovery can re-create it. The information in the log required for media recovery of an object is called a media image.

A media image is a sequence of log records containing an image of an object from which the object itself can be re-created.

The first log record required to re-create an object is known as its media recovery record ; it is the start of the latest media image for the object. The media recovery record of each object is one of the pieces of information recorded during a checkpoint.

When an object is re-created from its media image, it is also necessary to replay any log records describing updates performed on the object since the last image was taken.

Consider, for example, a local queue that has an image of the queue object taken before a persistent message is put onto the queue. In order to re-create the latest image of the object, it is necessary to replay the log entries recording the putting of the message to the queue, in addition to replaying the image itself.
When an object is created, the log records written contain enough information to completely re-create the object. These records make up the first media image of the object. Then, at each shutdown, the queue manager records media images automatically as follows:

Images of all process objects and queues that are not local
Images of empty local queues

Media images can also be recorded manually using the rcdmqimg command, described in rcdmqimg. This command writes a media image of the IBM MQ object.

The queue manager records media images automatically if IMGSCHED(AUTO) is set. For more information, see ALTER QMGR for information on IMGINTVL and INGLOGLN.

When a media image has been written, only the logs that hold the media image, and all the logs created after this time, are required to re-create damaged objects. The benefit of creating media images depends on such factors as the amount of free storage available, and the speed at which log files are created.

Recovering from media images

IBM MQ automatically recovers some objects from their media image if it finds that they are corrupted or damaged. In particular, recovery applies to objects found to be damaged during the normal queue manager startup. If any transaction was incomplete when the queue manager last shut down, any queue affected is also recovered automatically in order to complete the startup operation.

You must recover other objects manually, using the rcrmqobj command, which replays the records in the log to re-create the IBM MQ object. The object is re-created from its latest image found in the log, together with all applicable log events between the time the image was saved and the time the re-create command was issued. If an IBM MQ object becomes damaged, the only valid actions that can be performed are either to delete it or to re-create it by this method. Nonpersistent messages cannot be recovered in this way.

See rcrmqobj for further details of the rcrmqobj command.

The log file containing the media recovery record, and all subsequent log files, must be available in the log file directory when attempting media recovery of an object. If a required file cannot be found, operator message AMQ6767 is issued and the media recovery operation fails. If we do not take regular media images of the objects to re-create, you might have insufficient disk space to hold all the log files required to re-create an object.

What object files exist

The queue manager stores the attributes of objects that are defined in runmqsc in files on disk. These object files are in sub directories under the data directory of the queue manager.

For example, on UNIX and Linux platforms, channels are stored in /var/mqm/qmgrs/qmgr/channel.

The data in these object files is the media image of the objects. If these object files get deleted or corrupted, the object stored in that file is damaged. Using a linear logging queue manager, damaged objects can be recovered from the log using the rcrmqobj command.
Most object files contain just the attributes of the object, so channel files contain the attributes of channels. The exceptions are:

Catalog
The object catalog catalogs all the objects of all types and is stored in qmanager/QMQMOBJCAT.

Syncfiles
The syncfile contains internal state data associated with all channels.

Queues
Queue files contain both the messages on that queue as well as the attributes of that queue.

Note that there is no catalog or syncfile object exposed in runmqsc or IBM MQ Explorer.
The catalog and the queue manager can be recorded, but not recovered. If these objects get damaged the queue manager ends preemptively and these objects get recovered automatically on restart.

Subscriptions are not listed in objects to record or recover, because durable subscriptions are stored on a system queue. To record or recover durable subscriptions, record or recover the SYSTEM.DURABLE.SUBSCRIBER.QUEUE instead.

Recovering damaged objects during startup

If the queue manager discovers a damaged object during startup, the action it takes depends on the type of object and whether the queue manager is configured to support media recovery.

If the queue manager object is damaged, the queue manager cannot start unless it can recover the object. If the queue manager is configured with a linear log, and thus supports media recovery, IBM MQ automatically tries to re-create the queue manager object from its media images. If the log method selected does not support media recovery, we can either restore a backup of the queue manager or delete the queue manager.

If any transactions were active when the queue manager stopped, the local queues containing the persistent, uncommitted messages put or got inside these transactions are also required to start the queue manager successfully. If any of these local queues is found to be damaged, and the queue manager supports media recovery, it automatically tries to re-create them from their media images. If any of the queues cannot be recovered, IBM MQ cannot start.

If any damaged local queues containing uncommitted messages are discovered during startup processing on a queue manager that does not support media recovery, the queues are marked as damaged objects and the uncommitted messages on them are ignored. This situation is because it is not possible to perform media recovery of damaged objects on such a queue manager and the only action left is to delete them. Message AMQ7472 is issued to report any damage.

Recovering damaged objects at other times

Media recovery of objects is automatic only during startup. At other times, when object damage is detected, operator message AMQ7472 is issued and most operations using the object fail. If the queue manager object is damaged at any time after the queue manager has started, the queue manager performs a pre-emptive shutdown. When an object has been damaged we can delete it or, if the queue manager is using a linear log, attempt to recover it from its media image using the rcrmqobj command (see rcrmqobj for further details).

If a queue (or other object) gets damaged, MEDIALOG will not move forward. This is because MEDIALOG is the oldest extent required for media recovery. If your workload is continuing, CURRLOG will still be moving forward and so new extents will be written. Depending on your configuration (including your LogManagement setting), this might start filling your log filesystem. If the log filesystem fills completely, transactions get rolled back, and the queue manager might end abruptly. So when a queue gets damaged, you might have only a limited amount of time to act before your queue manager ends. How much time we have, depends on the rate at which your workload is causing the queue manager to write new extents,and the amount of free space we have in your log filesystem.

If you are using manual log management, you might be archiving extents not needed for restart recovery, and then deleting them from the log filesystem, even though they are still needed for media recovery. This is acceptable as long as we can restore them from your archive when needed. This policy does not cause your log filesystem to fill when a queue gets damaged and MEDIALOG stops moving forward. However, if we only archive and delete extents that are not needed for either restart or media recovery, your log filesystem starts to fill if a queue gets damaged.

If you are using automatic or archive log management, the queue manager will not reuse extents that are still needed for media recovery, even though you might have archived them and notified the queue manager using SET LOG ARCHIVED. Consequently if a queue gets damaged your log filesystem will start filling.

If a queue gets damaged you will get OBJECT DAMAGED FFDCs written and MEDIALOG stops moving forward. The damaged object can be identified from the FFDC or because it is the object with the oldest MEDIALOG when you display its status in runmqsc.

If your log filesystem is filling, and you are concerned that your workload is getting backed out because the log filesystem is becoming full, then recovering the object, or quiescing your workload might stop this happening.

Parent topic: Use the log for recovery