Product overview > Availability overview

High availability

With high availability, WebSphere eXtreme Scale provides reliable data redundancy and detection of failures.

WXS self-organizes data grids of JVMs into a loosely federated tree, with the catalog service at the root and core groups holding containers at the leaves of the tree.

Each core group is automatically created by the catalog service into groups of about 20 servers. The core group members provide health monitoring for other members of the group. Also, each core group elects a member to be the leader for communicating group information to the catalog service. Limiting the core group size allows for good health monitoring and a highly scalable environment.

In a WAS environment, in which core group size can be altered, eXtreme Scale does not support more than 50 members per core group.

Heart beating

  1. Sockets are kept open between JVMs, and if a socket closes unexpectedly, this unexpected closure is detected as a failure of the peer JVM. This detection catches failure cases such as the JVM exiting very quickly. Such detection also allows recovery from these types of failures typically in less than a second.

  2. Other types of failures include: an operating system panic, physical server failure or network failure. These failures are discovered through heart beating.
Heartbeats are sent periodically between pairs of processes: When a fixed number of heartbeats are missed, a failure is assumed. This approach detects failures in N*M seconds where N is the number of missed heart beats and M is the interval at which heartbeats should be set. Directly specifying M and N is not supported, and instead, a slider mechanism is used to allow a range of tested M and N combinations to be used.


There are several ways that a process can fail. The process could fail because some resource limit was reached, such as maximum heap size, or some process control logic terminated a process. The operating system could fail, causing all of the processes running on the system to be lost. Hardware can fail, though less frequently, like the network interface card (NIC), causing the operating system to be disconnected from the network.

Many more points of failure can occur, causing the process to be unavailable. In this context, all of these failures can be categorized into one of two types:

eXtreme Scale container failure

Container failures are generally discovered by peer containers through the core group mechanism. When a container or set of containers fails, the catalog service migrates the shards that were hosted on that container or containers. The catalog service looks for a synchronous replica first before migrating to an asynchronous replica. After the primary shards are migrated to new host containers, the catalog service looks for new host containers for the replicas that are now missing.

Container islanding - The catalog service migrates shards off of containers when the container is discovered to be unavailable. If those containers then become available, the catalog service considers the containers eligible for placement just like in the normal startup flow.

Container failure detection latency

Failures can be categorized into soft and hard failures. Soft failures are typically caused when a process fails. Such failures are detected by the operating system, which can recover used resources, such as network sockets, very quickly. Typical failure detection for soft failures is less than one second. Hard failures may take up to 200 seconds to detect using the default heart beat tuning. Such failures include: physical machine crashes, network cable disconnects or operating system failures. Thus, eXtreme Scale must rely on heart beating to detect hard failures which can be configured. See Failure detection types for details on lowering the time it takes to detect a hard failure.

Catalog service failure

Because the catalog service grid is an eXtreme Scale grid, it also uses the core grouping mechanism in the same way as the container failure process. The primary difference is that the catalog service domain uses a peer election process for defining the primary shard instead of the catalog service algorithm that is used for the containers.

Note that the placement service and the core grouping service are One of N services. A One of N service runs in one member of the high availability group. The location service and administration run in all of the members of the high availability group. The placement service and core grouping service are singletons because they are responsible for laying out the system. The location service and administration are read-only services and exist everywhere to provide scalability.

The catalog service uses replication to make itself fault tolerant. If a catalog service process fails, then the service should restart to restore the system to the desired level of availability. If all of the processes that are hosting the catalog service fail, eXtreme Scale has a loss of critical data. This failure results in a required restart of all the containers. Because the catalog service can run on many processes, this failure is an unlikely event. However, if you are running all of the processes on a single box, within a single blade chassis, or from a single network switch, a failure is more likely to occur. Try to remove common failure modes from boxes that are hosting the catalog service to reduce the possibility of failure.

Multiple container failures

A replica is never placed in the same process as its primary because if the process is lost, it would result in a loss of both the primary and the replica. The deployment policy defines a development mode boolean attribute that the catalog service uses to determine whether a replica can be placed on the same machine as a primary. In a development environment on a single machine, you might want to have two containers and replicate between them. However, in production, using a single machine is insufficient because loss of that host results in the loss of both containers. To change between development mode on a single machine and a production mode with multiple machines, disable development mode in the deployment policy configuration file.

Failure discovery and recovery summary

Loss type Discovery (detection) mechanism Recovery method
Process loss I/O Restart
Server loss Heartbeat Restart
Network outage Heartbeat Reestablish network and connection
Server-side hang Heartbeat Stop and restart server
Server busy Heartbeat Wait until server is available

See also

  1. Replication for availability
  2. High-availability catalog service
  3. Catalog server quorums

Parent topic:

Availability overview

Related concepts

Replicas and shards
Configure zones for replica placement
Multi-master data grid replication topologies
Distribute transactions
Map sets for replication