WebSphere eXtreme Scale Administration Guide > Plan application deployment > Tune performance


Configure failover detection


You can configure the amount of time between system checks for failed servers with the heartbeat interval setting.

Configure failover varies depending on the type of environment you are using. If you are using a stand-alone environment, you can configure failover with the command line. If you are using a WAS ND environment, configure failover in the WAS ND administrative console.


Configure failover for stand-alone environments

You can configure heartbeat intervals on the command line by using the -heartbeat parameter in the startOgServer script file. Set this parameter to one of the following values:

Value Action Description
0 Typical (default) Failovers are typically detected within 30 seconds.
-1 Aggressive Failovers are typically detected within 5 seconds.
1 Relaxed Failovers are typically detected within 180 seconds.

An aggressive heartbeat interval can be useful when the processes and network are stable. If the network or processes are not optimally configured, heartbeats might be missed, which can result in a false failure detection.


Configure failover for WebSphere Application Server environments

You can configure WAS ND v6.0.2 and later to allow WebSphere eXtreme Scale to fail over very quickly. The default failover time for hard failures is approximately 200 seconds. A hard failure is a physical computer or server crash, network cable disconnect or operating system error. Failures because of process crashes or soft failures typically fail over in less than one second. Failure detection for soft failures occurs when the network sockets from the dead process are closed automatically by the operating system for the server hosting the process.


Core group heartbeat configuration

WebSphere eXtreme Scale running in a WebSphere Application Server process inherits the failover characteristics from the core group settings of the application server. The following sections describe how to configure the core group heartbeat settings for different versions of WAS ND:


Update the core group settings for WAS ND v6.x and 7.x:

Specify the heartbeat interval in seconds on WebSphere Application Server versions from v6.0 through v6.1.0.12 or in milliseconds starting with v6.1.0.13. You must also specify the number of missed heartbeats. This value indicates how many heartbeats can be missed before a peer JVM is considered as failed.

The hard failure detection time is approximately the product of the heartbeat interval and the number of missed heartbeats.

These properties are specified using custom properties on the core group using the WebSphere administrative console.

These properties must be specified for all core groups used by the application:

The default value for the IBM_CS_FD_PERIOD_SEC property is 20 and for the IBM_CS_FD_CONSECUTIVE_MISSED property is 10.

If the IBM_CS_FD_PERIOD_MILLIS property is specified, then it overrides any of the set IBM_CS_FD_PERIOD_SEC custom properties.

Use the following settings to achieve a 1500 ms failure detection time for WAS ND v6.x servers:


Update the core group settings for WAS ND v7.0

WAS ND v7.0 provides two core group settings that can be adjusted to increase or decrease failover detection:

Heartbeat transmission period The default is 30000 milliseconds.
Heartbeat timeout period The default is 180000 milliseconds.

For more details on how change these settings, see the WAS ND Information center: Discovery and failure detection settings.

Use the following settings to achieve a 1500 ms failure detection time for WAS ND v7 servers:


What to do next

When these settings are modified to provide short failover times, there are some system-tuning issues to be aware of. First, Java is not a real-time environment. It is possible for threads to be delayed if the JVM is experiencing long garbage collection times. Threads might also be delayed if the machine hosting the JVM is heavily loaded (due to the JVM itself or other processes running on the machine). If threads are delayed, heartbeats might not be sent on time. In the worst case, they might be delayed by the required failover time. If threads are delayed, false failure detections occur. The system must be tuned and sized to ensure that false failure detections do not happen in production. Adequate load testing is the best way to ensure this.

The current version of eXtreme Scale supports WebSphere Real Time.



Parent topic

Tune performance


+

Search Tips   |   Advanced Search