EJB container failover behavior and tuning

If the failure occurs on the first initial request where the routing table information is not yet available, a COMM_FAILURE exception will be returned and the ORB will recognize that it has an indirect IOR available and re-send the request to the LSD to determine another server to route to. If the failure occurs after the client retrieved the routing table information, the WLM client will handle the COMM_FAILURE. The server will be removed from the list of selectable servers and the routing algorithm will be used to select a different server to route the request to.

Consider the following sequence of a client making a request to an appserver in the EJB container:

1. For the initial client request, no server cluster and routing information is available in the WLM client's runtime process. The request is therefore directed to the LSD that is hosted on a Node Agent to obtain routing information. If the LSD connection fails, the request will redirect to an alternative LSD (LSD failover, see 9.4.4, LSD failover). If this was not the first request, the WLM client would already have routing information for WLM-aware clients. However, for WLM-unaware clients, the LSD will always route requests to available servers. For future requests from the client, if there is a mismatch of the WLM client's routing information from what is on a server's, new routing information (as service context) will be added to the response.

2. After getting the InitialContext, a client does a lookup to the EJB's home object (an indirect IOR to the home object). If a failure occurs at this time, the WLM code will transparently redirect this request to another server in the cluster that is capable of obtaining the Bean's home object.

3. Server(s) become unusable during the life cycle of the request. If the request has strong affinity, there cannot be a failover of the request. The request will fail if the original server becomes unavailable. The client must perform recovery logic and resubmit the request. If the request is to an overloaded server, its unresponsiveness makes it seem as though the server is stopped, which may lead to a timeout. In these circumstances it may be helpful to change the server weight and/or tune the ORB and pool properties such as:
com.ibm.CORBA.RequestTimeout
com.ibm.CORBA.RequestRetriesCount
com.ibm.CORBA.RequestRetriesDelay
com.ibm.CORBA.LocateRequestTimeout

These can be command-line properties or changed using the Administrative Console. If the com.ibm.ejs.wlm.MaxCommFailures threshold has been reached for a cluster member, it is marked unusable. By default, the MaxCommFailures threshold is 0, so that after the first failure the appserver is marked unusable. This property can be modified by specifying -Dcom.ibm.ejs.wlm.MaxCommFailures=<number> as a command-line argument when launching a client. If a machine becomes unreachable (network and/or individual machine errors) before a connection to a server has been established, the operating system TCP/IP keep-alive timeout dominates the behavior of the system's response to a request. This is because a client will wait for the OS-specific, keep-alive, timeout before a failure is detected. This value can be modified, but as described in 5.7.3, Tuning failover, only with caution.

If a connection is already established to a server, com.ibm.CORBA.requestTimeout dominates (the default value is 180 seconds), and a client will wait this length of time before a failure is detected. The default value should only be modified if an application is experiencing timeouts repeatedly, and great care must be taken to tune it properly. If the value is set too high, the failover will be very slow, and set too low, requests will time out before the server has a chance to respond.

The two most critical factors affecting the choice of a timeout value are the amount of time to process a request and the network latency between the client and server. The time to process a request in turn depends on the application and the load on the server. The network latency depends on the location of the client. For example, those running within the same LAN as a server may use a smaller timeout value to provide faster failover. If the client is a process inside of a WAS (the client is a servlet), this property can be modified by editing the request timeout field on the Object Request Broker property sheet. If the client is a Java client, the property can be specified as a runtime option on the Java command line, for example:

java -Dcom.ibm.CORBA.requestTimeout=<seconds> MyClient A failed server is marked unusable, and a JMX notification is sent. The routing table is updated. WLM-aware clients are updated during request/response flows. Future requests will not route requests to this cluster member until new cluster information is received (for example, after the server process is restarted), or until the expiration of the com.ibm.ejs.wlm.unusable.interval. This property is set in seconds. The default value is 300 seconds. This property can be set by specifying -Dcom.ibm.ejs.wlm.unusable.interval=<seconds> on the command-line arguments for the client process. When a request results in an org.omg.CORBA.COMM_FAILURE or org.omg.CORBA.NO_RESPONSE, the return value of COMPLETION_STATUS determines whether a request can be transparently redirected to another server. In the case of COMPLETED_NO, the request can be rerouted. If the completed status is COMPLETED_YES, no failover is required.

The request was successful, but some communication error was encountered during the marshaling of the response. In the case of COMPLETED_MAYBE, WLM cannot verify whether the request was completed successfully, and cannot redirect the request. For example, consider a transaction that must be "at most once". In that case had WLM redirected the request, and it is possible the request would be serviced twice. The programming model is for the client to receive this exception and to have logic in place to decide whether or not to retry the request. If all servers are unavailable, the request will result in a org.omg.CORBA.NO_IMPLEMENT. At this point, either the network is down, or some other error has caused the entire cluster to be unreachable.

Please note that, similar to the situation of the Web container as discussed earlier, the appservers on a node will be forced to stop when the network is down if the loopback is not configured with the alias of a host IP.

  Prev | Home | Next

 

WebSphere is a trademark of the IBM Corporation in the United States, other countries, or both.

 

IBM is a trademark of the IBM Corporation in the United States, other countries, or both.