2.5.2 Web container failures

IBM

2.5.2 Web container failures

In a clustered environment with several cluster members, an unavailable appserver does not mean an interruption of the service. When the plug-in has selected a cluster member to handle a request it will attempt to communicate with the cluster member. There are however a number of situations when the plug-in might not be able to complete a request to a specific appserver. If this communication is unsuccessful or breaks, then the plug-in marks the cluster member as down and attempts to find another cluster member to handle the request. Web container failures are detected based on TCP response values or lack of response to a plug-in request.

The marking of the cluster member as down means that, should that cluster member be chosen as part of a workload management policy or in session affinity, the plug-in will not try to connect to it. The plug-in knows that it is marked as down and ignores it.

The following are some example scenarios when the plug-in cannot connect to a cluster member: - Expected appserver failures (the cluster member has been brought down intentionally for maintenance, for example).

- Unexpected server process failures (the appserver JVM has crashed, for example).

- Server network problems between the plug-in and the cluster member (a router is broken, for example).

- System problems (whether expected), such as system shutdown or power failures.

- The cluster member is overloaded and cannot process the request (for example because the system is too small to handle a large number of clients, or because the server weight is inappropriate).

In the first two failure cases described, the physical machine where the Web container is supposed to be running is still available, although the WebContainer Inbound Chain is not available. When the plug-in attempts to connect to the WebContainer Inbound Chain to process a request for a Web resource, the machine will refuse the connection, causing the plug-in to mark the appserver as down.

In the third and fourth events, however, the physical machine is no longer available to provide any kind of response. In these events, if non-blocking connection is not enabled, the plug-in waits for the local operating system to time out the request before marking the appserver unavailable. While the plug-in is waiting for this connection to time out, requests routed to the failed appserver appear to hang. The default value for the TCP timeout varies based on the operating system. While these values can be modified at the operating system level, adjustments should be made with great care. Modifications might result in unintended consequences in both WebSphere and other network dependent applications running on the machine. This problem can be eliminated by enabling non-blocking connection. Refer to Connection Timeout setting for more information.

In the fifth case, overloading can make a healthy server unavailable. To avoid overloading of servers, you can define the maximum number of connections that are allowed from HTTP servers to the appserver. This is explained in Maximum number of connections.

ibm.com/redbooks