Administration guide > Tune and performance
Configure failover detection
You can configure the amount of time between system checks for failed servers with the heartbeat interval setting.
Configure failover varies depending on the type of environment you are using. If you are using a stand-alone environment, you can configure failover with the command line. If you are using a WAS ND environment, configure failover in the WAS ND administrative console.
- Configure failover for stand-alone environments.
You can configure heartbeat intervals on the command line by using the -heartbeat parameter in the startOgServer script file. Set this parameter to one of the following values:
Table 1. Heartbeat intervals Value Action Description 0 Typical (default) Failovers are typically detected within 30 seconds. -1 Aggressive Failovers are typically detected within 5 seconds. 1 Relaxed Failovers are typically detected within 180 seconds.
An aggressive heartbeat interval can be useful when the processes and network are stable. If the network or processes are not optimally configured, heartbeats might be missed, which can result in a false failure detection.
- Configure failover for WAS environments.
You can configure WAS ND v6.0.2 and later to allow WebSphere eXtreme Scale to fail over very quickly. The default failover time for hard failures is approximately 200 seconds. A hard failure is a physical computer or server crash, network cable disconnect or operating system error. Failures because of process crashes or soft failures typically fail over in less than one second. Failure detection for soft failures occurs when the network sockets from the dead process are closed automatically by the operating system for the server hosting the process.
Core group heartbeat configuration
WebSphere eXtreme Scale running in a WAS process inherits the failover characteristics from the core group settings of the application server. The following sections describe how to configure the core group heartbeat settings for different versions of WAS ND:
- Update the core group settings for WAS ND v6.x and 7.x:
Specify the heartbeat interval in seconds on WAS versions from v6.0 through v184.108.40.206 or in milliseconds starting with v220.127.116.11. You must also specify the number of missed heartbeats. This value indicates how many heartbeats can be missed before a peer Java™ virtual machine (JVM) is considered as failed. The hard failure detection time is approximately the product of the heartbeat interval and the number of missed heartbeats.
These properties are specified using custom properties on the core group using the WebSphere administrative console. See Core group custom properties for configuration details. These properties must be specified for all core groups used by the application:
- The heartbeat interval is specified using either the IBM_CS_FD_PERIOD_SEC custom property for seconds or the IBM_CS_FD_PERIOD_MILLIS custom property for milliseconds (requires V18.104.22.168 or later).
- The number of missed heartbeats is specified using the IBM_CS_FD_CONSECUTIVE_MISSED custom property.
The default value for the IBM_CS_FD_PERIOD_SEC property is 20 and for the IBM_CS_FD_CONSECUTIVE_MISSED property is 10. If the IBM_CS_FD_PERIOD_MILLIS property is specified, then it overrides any of the set IBM_CS_FD_PERIOD_SEC custom properties. The values of these properties are positive integer values.
Use the following settings to achieve a 1500 ms failure detection time for WAS ND v6.x servers:
- Set IBM_CS_FD_PERIOD_MILLIS = 750 (WAS ND V22.214.171.124 and later)
- Set IBM_CS_FD_CONSECUTIVE_MISSED = 2
Update the core group settings for WAS ND v7.0
WAS ND v7.0 provides two core group settings that can be adjusted to increase or decrease failover detection:
- Heartbeat transmission period. The default is 30000 milliseconds.
- Heartbeat timeout period. The default is 180000 milliseconds.
For more details on how change these settings, see the WAS ND Information center: Discovery and failure detection settings.
Use the following settings to achieve a 1500 ms failure detection time for WAS ND v7 servers:
- Set the heartbeat transmission period to 750 milliseconds.
- Set the heartbeat timeout period to 1500 milliseconds.
What to do next
When these settings are modified to provide short failover times, there are some system-tuning issues to be aware of. First, Java is not a real-time environment. It is possible for threads to be delayed if the JVM is experiencing long garbage collection times. Threads might also be delayed if the machine hosting the JVM is heavily loaded (due to the JVM itself or other processes running on the machine). If threads are delayed, heartbeats might not be sent on time. In the worst case, they might be delayed by the required failover time. If threads are delayed, false failure detections occur. The system must be tuned and sized to ensure that false failure detections do not happen in production. Adequate load testing is the best way to ensure this.
The current version of eXtreme Scale supports WebSphere Real Time.
Parent topic:Configure deployment policies
Parent topic:Tune and performance
Operating systems and network tuning
Plan for network ports
ORB properties and file descriptor settings
JVM tuning for WebSphere eXtreme Scale
Use WebSphere Real Time
Controlling shard placement with zones
Tune the dynamic cache provider
Tune the cache sizing agent for accurate memory consumption estimates
Configure distributed deployments
Deployment policy descriptor XML file