6.3.2 Web container failures

6.3.2 Web container failures

In a clustered environment with several cluster members, an unavailable application server does not mean an interruption of the service. When the plug-in has selected a cluster member to handle a request it will attempt to communicate with the cluster member. There are, however, a number of situations in which the plug-in might not be able to complete a request to a specific application server. If this communication is unsuccessful or breaks, then the plug-in marks the cluster member as down and attempts to find another cluster member to handle the request. Web container failures are detected based on TCP response values or lack of response to a plug-in request.
The marking of the cluster member as down means that, should that cluster member be chosen as part of a workload management policy or in session affinity, the plug-in will not try to connect to it. The plug-in knows that it is marked as down and ignores it.
Some example scenarios when the plug-in cannot connect to a cluster member are:

Expected application server failures (The cluster member has been brought down intentionally for maintenance, for example.)
Unexpected server process failures (The application server JVM has crashed, for example.)
Server network problems between the plug-in and the cluster member (A router is broken, for example.)
System problems (whether expected), such as system shutdown or power failures
The cluster member is overloaded and cannot process the request (For example, because the system is too small to handle a large number of clients, or because the server weight is inappropriate.)

In the first two failure cases described, the physical machine where the Web container is supposed to be running is still available, although the WebContainer Inbound Chain is not available. When the plug-in attempts to connect to the WebContainer Inbound Chain to process a request for a Web resource, the machine will refuse the connection, causing the plug-in to mark the application server as down.
In the third and fourth events, however, the physical machine is no longer available to provide any kind of response. In these events, if non-blocking connection is not enabled, the plug-in waits for the local operating system to time out the request before marking the application server unavailable. While the plug-in is waiting for this connection to time out, requests routed to the failed application server appear to hang. The default value for the TCP timeout varies based on the operating system. While these values can be modified at the operating system level, adjustments should be made with great care. Modifications might result in unintended consequences in both WebSphere and other network dependent applications running on the machine. This problem can be eliminated by enabling non-blocking connection. Refer to Connection timeout for more information.
In the fifth case, overloading can make a healthy server unavailable. To avoid overloading of servers, you can define the maximum number of connections that are allowed from HTTP servers to the application server. This is explained in Maximum number of connections.

xxxx