Coldstart: What to do if log extents are missing or corrupt
If your enterprise loses some or all of the log extents needed for restart recovery, the queue manager will be unable to replay the recovery log and so fails to restart. If you require your queue manager to restart when the recovery log is corrupt in any way, at the expense of maintaining data integrity, it is possible to do so, although strongly discouraged. This process is known as coldstarting a queue manager.
The effects of coldstart
On coldstart, the queue manager creates an empty recovery log and relies on the data in the queue files and other object files in their existing state. Because the data in the queue files can be inconsistent, messages might be lost, duplicated, corrupted, or inconsistent.
The queue manager stores the configuration of all the other persisted objects in the recovery log, as well as in object files. Other internal state data is also recorded in the recovery log as well, so on coldstart, internal state data is reset and all this other configuration data might be inaccurate.
The effects of coldstart are unpredictable and wide-ranging so we should avoid a coldstart unless absolutely necessary. After coldstarting, the information in the queue and object files can be so inconsistent that the queue manager will not restart at all.
If the queue manager does restart, there is no simple way of discovering what message data or configuration can be relied on and what cannot. Also, after a coldstart, queues might be damaged and so become completely unusable.
Additionally, if we can get from, or put to, a particular queue, the messages on it might be corrupt, missing, or duplicated. Transactions and channels might be stuck in-doubt. Even if your queue manager coldstarts successfully and the queues look intact, the unpredictable effects of the coldstart might not be realized until much later. Important: IBM does not support any enterprise running on a queue manager that was previously coldstarted.What to do if we need to coldstart
As IBM does not support any enterprise running on a queue manager that was coldstarted, we are strongly discouraged from carrying out a coldstart. However, if we are in a position where you definitely need to coldstart a queue manager, contact IBM MQ Support .
The process for coldstarting a queue manager used to be much more complicated for a linear queue manager than a circular one. In IBM MQ Version 9.1.3, the coldstart process has been much simplified, and does not involve copying or renaming log extents any more.
From IBM MQ Version 9.1.3, contact IBM Support, who will give you a key which you pass to the strmqm command to coldstart a queue manager. Attention: The IBM MQ Version 9.1.3 coldstart command still carries the same risks of losing data integrity, and we are still not fully supported once you have coldstarted the queue manager.Eliminating future cold starts: a request
The strmqm command requires a key to coldstart, because IBM MQ wants you to contact IBM MQ Support if we need to coldstart, as IBM MQ is keen to understand how you got into this situation.
Clearly coldstart is something that is best avoided. IBM MQ has gone to considerable effort to make sure that we will not need to coldstart your queue manager, and IBM is keen to discover if there is anything more the product can do to alleviate having to coldstart.
Precautions to avoid a coldstart
The default logging method when creating a queue manager is circular logging. With circular logging you allow the queue manager a particular number of primary and secondary log extents of a given size. Create your log filesystem large enough to contain all the primary and secondary log extents, and we should never need to administer them.
Alternatively, we can use linear logging as opposed to circular. Linear logging gives you the added ability to recover queues and other objects, in the unlikely event that they become damaged. But by default, linear logging requires you to delete log extents that are no longer needed for restart or media recovery. This is referred to as manual log management.
When administering log extents in this way, it is possible to inadvertently delete too many log extents and so end up having to coldstart. To mitigate this risk, use automatic log management, so the queue manager manages log extents on your behalf.
The best practice is to put your recovery log in a separate log filesystem which only contains the recovery log. If you put your recovery log in the same filesystem as the rest of our queue manager, we can sometimes find that filesystem accidentally filling up, perhaps due to large queue files. Either make the log directory for the queue manager a separate filesystem, or specify a different log filesystem using the -ld command line option on the crtmqm command.
If the filesystem holding the queue files fills, you might not be able to put to those queues, but the queue manager continues running. If the filesystem containing the recovery log fills, the queue manager ends abruptly and will not restart until you free up some space.
Be careful not to delete log extents needed for restart recovery, otherwise you might find yourself needing to coldstart. Sometimes you might find that we need to coldstart because the disk failed that contains their recovery log. Best practice is to put the recovery log on a replicated disk and so mitigate the risk of a disk crash.
Moving your messages and configuration to a new replacement queue manager avoids the possibility of ongoing problems with a queue manager that has been previously coldstarted, and reinstates full support.
Keep a note of which queue managers have been previously coldstarted, even if they were coldstarted a long time ago and have been stopped, restarted, and migrated in the meantime. When you contact IBM Support, say if the queue manager has been previously coldstarted and if so, give as much information as possible into what caused the requirement for a coldstart.
Parent topic: Manage logs