Peer restart and recovery

(ZOS) Peer restart and recovery

The goal of every system is to have as little downtime as possible. Sometimes, however, system failures are inevitable. For example, a system failure might occur because the power unexpectedly goes out in your main system. When a system failure occurs, a restart action we can take is to restart on a peer system in the sysplex. This type of restart uses the peer restart and recovery function. Starting a server on a system to which it was not configured implicitly places it into peer restart and recovery mode.

Deprecated feature: Peer restart and recovery (PRR) functionality is deprecated. We should use the integrated high availability support for the transaction service subcomponent, instead of Peer Restart and Recovery for transaction recovery.depfeat
When we experience a main system failure that results in InDoubt transactions with unknown outcomes, we need to obtain those intended transactional outcomes (ideally correctly) before the data can be utilized again. Peer restart and recovery provides an automated means of accomplishing this by restarting the controller on a peer system so that the "locks" that block the data can be dropped and the outcomes determined. This is in contrast to how a system usually handles a failure by automatically rolling back.

If a failure occurs, automatic restart management:

Can restart the product and related servers on the same system, or
Can use the peer restart and recovery function to restart related servers on an alternate system in the cell.
The server is not a recoverable resource manager. It is a recoverable communication manager. It has no recoverable locks of its own and it does not need to manage locks nor manage lock states in a log. It just needs to make sure that both callers and callees are connected in each of the communications sessions of a distributed transaction.

Peer restart and recovery restarts the controller on another system and goes through the transaction restart and recovery process so that we can assign outcomes to transactions that were in progress at the time of failure. During this transaction restart and recovery process, data might be temporarily inaccessible until the recovery process is complete. The restart and recovery process does not result in lost data.

Resource managers, such as DB2, that were being accessed at the time of failure may hold locks that are scoped to a transaction UR (unit of recovery). Once an outcome has been assigned to a UR, the resource managers will, generally, drop those locks.

Subtopics

When might PRR fail to recover servers
The major reason for peer restart and recovery (PRR) failure is if we experience a network outage while in the process of recovering. If the system cannot reach the superior or subordinate because the network is dead, communications cannot reestablish and the transaction cannot completely resolve.

Related:

Transactional high availability
Configure transaction properties for peer recovery
Repository service custom properties