High availability manager
Overview
WebSphere Application Server includes a high availability manager (HAM) component whose main function is to eliminate single points of failure.
HAM runs inside the JVM of a WAS cell and and ensures that key services stay running by employing a strategy that moves sessions between available application servers, rather than relying on dedicated sessions to a single appserver, which is the strategy employed by the deployment manager. HAM provides peer-to-peer failover for critical services.
HAM takes advantage of fault tolerant storage technologies such as network attached storage (NAS)
HAM is responsible for managing the availability of singletons within the cell. Examples of singletons include:
- Transaction managers for cluster members.
- Messaging engines.
- Workload manager (WLM) controllers.
These controllers are responsible for gathering the end points for applications deploying in a cluster and aggregating that information to a single route table for that cluster.
- WLM routing information.
When a cluster member is hosting a resource that can be clustered, such as an application, a transaction manager or a messaging engine, this information needs to be shared between all of the JVMs in the cell.
- Application servers.
- WAS partitioning facility instances.
HAM continually monitors the application server environment. If an application server component fails, the HAM takes over the in-flight and in-doubt work for the failed server. This action significantly improves application server availability.
In a highly available environment, all single points of failure are eliminated. Because the HAM function is dynamic, any configuration changes that you make and save while an application server is running are eventually be picked up and used. You do not have to restart an application server to enable a change. For example, if you change a policy for a messaging engine high availability group while the messaging engine is running, the new policy is dynamically loaded and applied, and the behavior of the messaging engine reflects this change.
To provide this focused failover service, the HAM supervises the JVMs of the application servers that are core group members. The HAM uses one of the following methods to detect failures:
- An application server is marked as failed if the socket fails.
This method uses the KEEP_ALIVE function of TCP/IP, and is very tolerant of extreme application server loading, which might occur if the application server is swapping or thrashing heavily. This method is recommended for determining a JVM failure if you are using multicast emulation, and are running enough JVMs on a single application server to push the application server into extreme CPU starvation or memory starvation.
- A JVM is marked as failed if it stops sending heartbeats for a specified time interval.
This method is referred to as active failure detection. When it is used, a JVM sends out one heartbeat, or pulse every ten seconds. If the JVM is unresponsive for more than 200 seconds, it is considered down.
Active heart beating always occurs regardless of which transport the HAM is using. We can change the values specified for the following core group custom properties to reduce the amount of heat beating that occurs:
IBM_CS_FD_PERIOD_SECS Number of seconds between heart beats. IBM_CS_FD_CONSECUTIVE_MISSED Number of heart beats that must be missed to mark a peer as a suspect.
In either case, if a JVM fails, the application server on which it is running is separated from the core group and any services running on that application server are failed over to the surviving core group members.
A JVM can be a node agent, an application server or a deployment manager. If a JVM fails, any singletons running in that JVM are restarted on a peer JVM after the failure is detected. This peer JVM is already running, and eliminates the normal startup time, which potentially can be minutes.
All of the application servers in a cell are defined as members of a core group. Each core group has only one logical HAM that services all of the members of that core group. The HAM is responsible for making the services within a core group highly available and scalable. It continually polls all of the core group members to verify that they are active and healthy.
A policy matching program is used to localize certain policy-driven components and to place these components into high availability groups. When a core group member fails, the high availability manger assigns the failing member's work to the same type of component from the same high availability group. Using NAS devices in the position of common logging facilities helps to recover in-doubt and in-flight work if a component fails.
WAS provides a default core group that is created during installation. New server instances are added to the default core group as they are created. The WAS environment can support multiple core groups, but one core group is usually sufficient for most environments.
HAM is manages a variety of components. All of the components in a HAM infrastructure work together to ensure peer-to-peer failover effectively protects the application server environment from failures. The following table describes the main high availability components, or areas of components, required for an effective high availability manager environment:
Server component areas The focus is on the application server run time, which includes such entities as cells and clusters. These areas are necessary for a healthy high availability manager run time because they closely relate to core groups, high availability groups, and the policy that defines the infrastructure. Core groups Provide failover support. A default core group is created during startup. This core group should be sufficient for most environments. Additional core groups can be created, but you should only create them if you fully understand the implications to your high availability environment. Core groups are static in nature. The configuration applied to a core group through user-defined policies determines the dynamic relationship within high availability groups.
High availability groups Closely bound to policy definitions. High availability groups are dynamic in nature and are not configured directly by users. Policy match criteria determines the high availability group to which a core group member belongs. Network components Provide the underlying network infrastructure that is crucial to the success of the HAM. By default, WAS uses a channel framework protocol. However, a unicast or multicast protocol can also be used. The network components include a technology that enables communication throughout the HAM infrastructure. All of these components must be active and properly configured to achieve a highly available infrastructure.
See Also
Core groups
High availability network components
Peer recovery of transactions
High availability groups