Transfer recovery timeout concepts

We can set the amount of time, in seconds, during which a source agent keeps trying to recover a stalled file transfer. If the transfer is not successful when the agent reaches the timeout for the retry interval, the transfer fails.

Recovery timeout precedence

A transfer recovery timeout value for an individual transfer specified through the fteCreateTransfer, fteCreateTemplate, or fteCreateMonitor commands, or by using IBM MQ Explorer, or specified in the fte:filespec nested element, takes precedence over the value that is specified for the transferRecoveryTimeout parameter in the agent.properties file for the source agent.

For example, if the fteCreateTransfer command is started without the -rt parameter and value pair, the source agent AGENT1 checks the agent.properties file for a transferRecoveryTimeout value to determine the recovery timeout behavior:

fteCreateTransfer -sa AGENT1 -da AGENT2 -df C:\import\transferredfile.txt C:\export\originalfile.txt

If the transferRecoveryTimeout parameter in the agent.properties file is either not set or is set to -1, the agent follows the default behavior and tries to recover the transfer until it is successful. However, if the fteCreateTransfer command includes the -rt parameter, the value of this parameter takes precedence over the value in the agent.properties file and is used as the recovery timeout setting for the transfer:

fteCreateTransfer -sa AGENT1 -da AGENT2 -rt 21600 -df C:\import\transferredfile.txt C:\export\originalfile.txt

Recovery timeout counter

The recovery timeout counter starts when the transfer enters recovering state. A transfer log message is published to the SYSTEM.FTE topic with the topic string Log/agent_name/transfer_ID to indicate that the transfer status is changed to recovering and the source agent clock time at which the status changed. If the transfer is resumed within the set retry interval and does not reach the recovery timeout (counter<=recovery timeout), then the counter is reset to 0, ready to start again if the transfer enters recovery.

If the counter reaches the maximum value set for the recovery timeout (counter==recovery timeout), the recovery of the transfer stops and the source agent reports the transfer as failed. This type of transfer failure, caused by the fact that the transfer reached the recovery timeout, is indicated by the message code, RECOVERY TIMEOUT (69). Another transfer log message is published to the SYSTEM.FTE topic, with a topic string of Log/agent_name/transfer_ID, to indicate that the transfer is failed and includes a message, the return code, and the source agent's event log. The source agent's event log is updated with a message when any of the following events occur during recovery:

When the recovery timeout parameter is set to a value greater than -1, the transfer enters recovery. The agent's event log is updated to indicate the start of the recovery timer for the TransferId and the amount of time the source agent waits before it initiates the recovery timeout processing.
When the recovering transfer is resumed, the source agent's event log is updated with a new message to indicate that the TransferId that was in recovery is resumed.
When a recovering transfer has timed out, the source agent's event log is updated to indicate the TransferId that failed while recovering due to recovery timeout.

These log messages enable the users (subscribers and loggers) to identify the transfers that failed due to the transfer recovery timeout.

The counter for the recovery timeout is always at the source agent. However, if the destination agent fails to receive information from the source agent in a timely manner, it can send a request to the source agent to put the transfer in recovery. For a transfer where the recovery timeout option is set, the source agent starts the recovery timeout counter when it receives the request from the destination agent.

Manual handling is still required for transfers that do not use the recovery timeout option, the failed, and partially complete transfers.

For transfer sets, where a single transfer request is issued for multiple files, and some of the files completed successfully but one completed only partially, the transfer is still marked as failed as it did not complete as expected. The source agent might have timed out while transferring the partially completed file.

Ensure that the destination agent and file server are ready and in a state to accept file transfers.

We have to issue the transfer request again for the entire set, but to avoid problems because some of the files remain on the destination from the initial transfer attempt, we can issue the new request with the overwrite if existing option specified. This ensures that the incomplete set of files from the previous transfer attempt are cleaned up as a part of the new transfer, before the files are written to the destination again.

From Version 9.1.5, it is no longer necessary to manually remove part files left on a destination after an initial transfer attempt has failed. If a transfer recovery timeout is set for a transfer, the source agent moves the transfer into the RecoveryTimedOut state if transfer recovery times out. After the transfer has been resynchronized, the destination agent removes any part files that were created during the transfer and sends a completion message to the source agent.

Traces and messages

Tracing points are included for diagnostic purposes. The recovery timeout value, start of the retry interval, start of the resume period and counter reset, and whether the transfer timed out and failed, are logged. In case of a problem or unexpected behavior, we can collect the source agent output log and trace files, and provide them when requested by IBM support, to help with troubleshooting.

Messages notify you when:

A transfer enters recovery (BFGTR0081I)
A transfer is terminated because it timed out from recovery (BFGSS0081E)
Atransfer resumes after being in recovery (BFGTR0082I)

Parent topic: Set a timeout for recovery of stalled transfers

Related concepts

MFT recovery and restart