Retry interval

Workload distribution policy in the configuration file. | Connection time-out settings to bypass the operating system time out.

Retry interval

+
Search Tips | Advanced Search

Overview

There is a setting in the plug-in configuration file that allows you to specify how long to wait before retrying a server that is marked as down. This is useful in avoiding unnecessary attempts when you know that server is unavailable. The default is 60 seconds.
This setting is specified in the configuration of each Web server, as shown in Figure 18-2, on the Retry interval field. This default setting means that if a cluster member was marked as down, the plug-in would not retry it for 60 seconds. To change this value, go to...
Servers | Web servers | WebServer_Name | Plug-in properties | Request Routing

Finding the correct setting

There is no way to recommend one specific value. The value chosen depends on your environment, for example, on the number of cluster members in your configuration.
Setting the retry interval to a small value allows an application server that becomes available to quickly begin serving requests. However, too small of a value can cause serious performance degradation, or even cause your plug-in to appear to stop serving requests, particularly in a machine outage situation. For example, if you have numerous cluster members, and one cluster member being unavailable does not affect the performance of your application, then you can safely set the value to a very high number.
Alternatively, if your optimum load has been calculated assuming all cluster members to be available or if you do not have very many, then you will want your cluster members to be retried more often to maintain the load.
Also, take into consideration the time it takes to restart your server. If a server takes a long time to boot up and load applications, then you will need a longer retry interval.
Another factor to consider for finding the correct retry interval for your environment is the operating system TCP/IP timeout value. To explain the relationship between these two values, let us look at an example configuration with two machines, which we call A and B. Each of these machines is running two clustered application servers (CM1 and CM2 on A, CM3 and CM4 on B). The HTTP server and plug-in are running on AIX with a TCP timeout of 75 seconds, the retry interval is set to 60 seconds, and the routing algorithm is weighted round robin. If machine A fails, either as expected or unexpectedly, the following process occurs when a request comes in to the plug-in:

The plug-in accepts the request from the HTTP server and determines the server cluster.
The plug-in determines that the request should be routed to cluster member CM1 on system A.
The plug-in attempts to connect to CM1 on machine A. Because the physical machine is down, the plug-in waits 75 seconds for the operating system TCP/IP timeout interval before determining that CM1 is unavailable.
The plug-in attempts to route the same request to the next cluster member in its routing algorithm, CM2 on machine A. Because machine A is still down, the plug-in must again wait 75 seconds for the operating system TCP/IP timeout interval before determining that CM2 is also unavailable.
The plug-in attempts to route the same request to the next cluster member in its routing algorithm, CM3 on system B. This application server successfully returns a response to the client, about 150 seconds after the request was first submitted.
While the plug-in was waiting for the response from CM2 on system A, the 60-second retry interval for CM1 on system A expired, and the cluster member is added back into the routing algorithm. A new request is routed to this cluster member, which is still unavailable, and this lengthy waiting process will begin again.

There are two options to avoid this problem:

The recommended approach is to configure your application servers to use a non-blocking connection. This eliminates the impact of the operating system TCP/IP timeout. See Connection timeout for information.
An alternative is to set the retry interval to a more conservative value than the default of 60 seconds, related to the number of cluster members in your configuration. A good starting point is 10 seconds + (number of cluster members * TCP timeout). This ensures that the plug-in does not get stuck in a situation of constantly trying to route requests to the failed members. In the scenario described before, this setting would cause the two cluster members on system B to exclusively service requests for 235 seconds before the cluster members on system A are retried, resulting in another 150-second wait.