6.5.1 Active failure detection

A JVM is marked as failed if its heartbeat signals to its core group peers are lost for a specified interval. The DCS sends heartbeats between every JVM pair in a view. With the default settings, heartbeats are sent every 10 seconds and 20 heartbeat signals must be lost before a JVM is raised as a suspect and a failover is initiated. The default failure detection time is therefore 200 seconds.

This setting is very high and should be modified by most customers in a production environment. A setting of 10 to 30 seconds is normally recommended for a well tuned cell.

Contact IBM support or services if there is a need to tune these settings below the recommended range.

When a JVM failure is detected, it is suspected by others in the view. This can be seen in the SystemOut.log shown in Example 6-6. The new view installation in this case is fast in order to achieve fast recovery. New view installations are slower for new views generated from JVM starts. Otherwise, there would be frequent view installations when several JVMs are started together.

Heartbeat delivery can be delayed due to a number of commonly-seen system problems: - Swapping

When a system is swapping, the JVM could get paged and heartbeat signals are not sent or received in time. - Thread scheduling thrashing

Java is not a real time environment. When there are a lot of runable threads accumulated in a system, each thread will suffer a long delay before getting scheduled. Threads of a JVM might not get scheduled to process heartbeat signals in a timely fashion. This thread scheduling problem also impacts the applications on that system as their response times will also be unacceptable. Therefore, systems must be tuned to avoid CPU starving or heavy paging.

Notes:
-	For WebSphere V6.0.2, the default heartbeat settings have been changed to sending a heartbeat every 30 seconds and six consecutive lost heartbeats denote a failure.

Any of the above problems can cause instability in your high availability environment. After tuning the system not to suffer from swapping or thread thrashing, the heartbeat interval can be lowered to increase the sensitivity of failure detection.

Use the core group custom properties listed in Table 6-4 to change the heartbeat frequency.


Name	Description	Default value
IBM_CS_FD_PERIOD_SECS	This is the interval between heartbeats in seconds.	10
IBM_CS_FD_CONSECUTIVE_MISSED	This is the number of missed heartbeats to mark a server as a suspect.	20

Changing the frequency of active failure detection

Heartbeating is always enabled regardless of the message transport type for the HAManager.

ibm.com/redbooks