High availability manager

WebSphere Application Server uses a high availability manager to eliminate single points of failure. A high availability manager is responsible for running key services on available application servers rather than on a dedicated one (such as the deployment manager). It takes advantage of fault tolerant storage technologies such as network attached storage (NAS) or a highly available file system (made highly available by replicating data to another server and the use of IP takeover), which significantly lowers the cost and complexity of high availability configurations. The high availability manager also provides peer-to-peer failover for critical services by always maintaining a backup for these services.

A high availability manager continually monitors the application server environment. If an application server component fails, the high availability manager takes over the in-flight and in-doubt work for the failed server. This action significantly improves application server availability.

In a highly available environment, all single points of failure are eliminated. Because the high availability manager function is dynamic, any configuration changes that you make and save while an application server is running are eventually be picked up and used. You do not have to restart an application server to enable a change. For example, if you change a policy for a messaging engine high availability group while the messaging engine is running, the new policy is dynamically loaded and applied, and the behavior of the messaging engine reflects this change. A high availability manager focuses on recovery support and scalability in the following areas:

Messaging
Transaction managers
Workload Management (WLM) controllers
Application servers
WebSphere partitioning facility instances

To provide this focused failover service, the high availability manager supervises the Java Virtual Machines (JVMs) of the application servers that are core group members. The high availability manager uses one of the following methods to detect failures:

An application server is marked as failed if the socket fails. This method uses the KEEP_ALIVE function of TCP/IP, and is very tolerant of extreme application server loading, which might occur if the application server is swapping or thrashing heavily. This method is recommended for determining a JVM failure if you are using multicast emulation, and are running enough JVMs on a single
application server to push the application server into extreme CPU starvation or memory starvation.
A JVM is marked as failed if it stops sending heartbeats for a specified time interval. This method is referred to as active failure detection. When it is used, a JVM sends out one heartbeat, or pulse every second. If the JVM is unresponsive for more than 20 seconds, it is considered down. You can use this method with multicast emulation. However, this method must be used for true multicast addressing.

In either case, if a JVM fails, the application server on which it is running is separated from the core group and any services running on that application server are failed over to the surviving core group members.

A JVM can be a node agent, an application server or a deployment manager. If a JVM fails, any singletons running in that JVM are restarted on a peer
JVM after the failure is detected. This peer JVM is already running, and eliminates the normal startup time, which potentially can be minutes.

All of the application servers in a cell are defined as members of a core group. Each core group has only one logical high availability manager that services all of the members of that core group. The high availability manager is responsible for making the services within a core group highly available and scalable. It continually polls all of the core group members to verify that they are active and healthy.

A policy matching program is used to localize certain policy-driven components and to place these components into high availability groups. When a core group member fails, the high availability manger assigns the failing member's work to the same type of component from the same high availability group. Using NAS devices in the position of common logging facilities helps to recover in-doubt and in-flight work if a component fails.

WebSphere Application Server provides a default core group that is created during installation. New server instances are added to the default core group as they are created. The WebSphere Application Server environment can support multiple core groups, but one core group is usually sufficient for most environments. A high availability manager is comprised of a variety of components. All of the components in a high availability manager infrastructure work together to ensure peer-to-peer failover is effectively protecting the application server environment from failures. The following table describes the main high availability components, or areas of components, required for an effective high availability manager environment:

Server component areas Focus on the application server run time and include such entities as cells and clusters. These areas are necessary for a healthy high availability manager run time because they closely relate to core groups, high availability groups, and the policy that defines the infrastructure.
Core groups Provide failover support. A default core group is created during startup. This core group should be sufficient for most environments. Additional core groups can be created, but you should only create them if you fully understand the implications to your high availability environment.
Core groups are static in nature. The configuration applied to a core group through user-defined policies determines the dynamic relationship within high availability groups.
High availability groups Closely bound to policy definitions. High availability groups are dynamic in nature and are not configured directly by users. Policy match criteria determines the high availability group to which a core group member belongs.
Network components Provide the underlying network infrastructure that is crucial to the success of the high availability manager. By default, WebSphere Application Server uses a channel framework protocol. However, a unicast or multicast protocol can also be used. The network components include a technology that enables communication throughout the high availability manager infrastructure.

All of these components must be active and properly configured to achieve a highly available infrastructure.

Server component areas	Focus on the application server run time and include such entities as cells and clusters. These areas are necessary for a healthy high availability manager run time because they closely relate to core groups, high availability groups, and the policy that defines the infrastructure.
Core groups	Provide failover support. A default core group is created during startup. This core group should be sufficient for most environments. Additional core groups can be created, but you should only create them if you fully understand the implications to your high availability environment. Core groups are static in nature. The configuration applied to a core group through user-defined policies determines the dynamic relationship within high availability groups.
High availability groups	Closely bound to policy definitions. High availability groups are dynamic in nature and are not configured directly by users. Policy match criteria determines the high availability group to which a core group member belongs.
Network components	Provide the underlying network infrastructure that is crucial to the success of the high availability manager. By default, WebSphere Application Server uses a channel framework protocol. However, a unicast or multicast protocol can also be used. The network components include a technology that enables communication throughout the high availability manager infrastructure.