RetryInterval and operating system TCP timeout

RetryInterval and operating system TCP timeout

If a request to an appserver in a cluster fails, and there are other application servers in the group, the plug-in will transparently reroute the failed request to the next application server in the routing algorithm. The unresponsive application server is marked unavailable and all new requests will be routed to the other application servers in the server cluster.
The amount of time the appserver remains unavailable after a failure is configured by the RetryInterval property on the <ServerGroup> attribute. If this attribute is not present, the default value is 60 seconds.
<ServerCluster Name="HACluster"> RetryInterval=600>
The failover behavior is shown in Example 9-3.
Example 9-3 Marked down appserver

serverGroupCheckServerStatus: Server WebHAbbMember4 is marked down; retry in 598
serverGroupCheckServerStatus: Checking status of WebHAbbMember4, ignoreWeights 0, markedDown 1, retryNow 0, wlbAllows 1
...
serverGroupCheckServerStatus: Server WebHAbbMember4 is marked down; retry in 579
serverGroupCheckServerStatus: Checking status of WebHAbbMember4, ignoreWeights 0, markedDown 1, retryNow 0, wlbAllows 1
...
serverGroupCheckServerStatus: Server WebHAbbMember4 is marked down; retry in 6
serverGroupCheckServerStatus: Checking status of WebHAbbMember4, ignoreWeights 0, markedDown 1, retryNow 0, wlbAllows 1
...
serverGroupCheckServerStatus: Server WebHAbbMember4 is marked down; retry in 0
serverGroupCheckServerStatus: Checking status of WebHAbbMember4, ignoreWeights 0, markedDown 1, retryNow 1, wlbAllows 1

When the RetryInterval expires, the plug-in will add the appserver back into the routing algorithm and attempt to send a request to it. If the request fails or times out, the application server is again marked unavailable for the length of the RetryInterval.
The proper setting for the RetryInterval will depend on your environment, particularly the value of the operating system TCP timeout value and how many appservers are configured in the cluster. Setting the RetryInterval to a small value will allow an application server that becomes available to quickly begin serving requests. However, too small of a value can cause serious performance degradation, or even cause your plug-in to appear to stop serving requests, particularly in a machine outage situation.
To explain how this can happen, let's look at an example configuration with two machines, which we will call A and B. Each of these machines is running two clustered appservers (CM1 and CM2 on A, CM3 and CM4 on B). The HTTP server and plug-in are running on an AIX system with a TCP timeout of 75 seconds, the RetryInterval is set to 60 seconds, and the routing algorithm is weighted round robin. If machine A fails, either expectedly or unexpectedly, the following process occurs when a request comes in to the plug-in:
1. The plug-in accepts the request from the HTTP server and determines the server cluster.
2. The plug-in determines that the request should be routed to cluster member CM1 on system A.
3. The plug-in attempts to connect to CM1 on machine A. Because the physical machine is down, the plug-in waits 75 seconds for the operating system TCP timeout interval before determining that CM1 is unavailable.
4. The plug-in attempts to route the same request to the next cluster member in its routing algorithm, CM2 on machine A. Because machine A is still down, the plug-in must again wait 75 seconds for the operating system TCP timeout interval before determining that CM2 is also unavailable.
5. The plug-in attempts to route the same request to the next cluster member in its routing algorithm, CM3 on system B. This appserver successfully returns a response to the client, over 150 seconds after the request was first submitted.
6. While the plug-in was waiting for the response from CM2 on system A, the 60-second RetryInterval for CM1 on system A expired, and the cluster member is added back into the routing algorithm. A new request will soon be routed to this cluster member, which is still unavailable, and this lengthy waiting process will begin again.
To avoid this problem, we recommend setting a more conservative RetryInterval, related to the number of cluster members in your configuration. A good starting point is 10 seconds + (#_of_cluster_members * TCP_Timeout). This ensures that the plug-in does not get stuck in a situation of constantly trying to route requests to the failed members. In the scenario described before, this setting would cause the two cluster members on system B to exclusively service requests for 235 seconds before the cluster members on system A are retried, resulting in another 150-second wait.
As mentioned earlier, another option is to configure your appservers to use a non-blocking connection. This eliminates the impact of the operating system TCP/IP timeout. ConnectTimeout setting explains how to configure this option.
Prev | Home | Next

WebSphere is a trademark of the IBM Corporation in the United States, other countries, or both.

IBM is a trademark of the IBM Corporation in the United States, other countries, or both.