Availability definition

 

Before we describe different high availability (HA) implementations for WebSphere systems, we first need to define high availability and discuss how to measure high availability. Availability is a measure of the time that a server is functioning normally, as well as a measure of the time the recovery process requires after the system fails. In other words, it is the downtime that defines system availability. This downtime includes both planned and unplanned downtime.

Let A be an index of system availability expressed as a percentage, MTBF the mean time between failures, and MTTR the maximum time to recover the system from failures; then we have:

A = MTBF/(MTBF + MTTR)

As MTBF gets larger, A increases and MTTR has less impact on A. As MTTR approaches zero, A increases toward 100 percent. This means that if we can recover from failures very quickly, we will have a highly available system. The time to recover a system includes fault detection time and system recovery time. Therefore, clustering software uses fault detection mechanisms and automatically fails over the services to a healthy host to minimize the fault detection time and the service recovery time. MTTR is minimized because the fault detection time is minimized and no repair attempt is needed. Therefore, A is significantly raised. Any repairs to the failed node and any upgrades of software and hardware will not impact the service availability. This is the so-called hot replacement or rolling upgrade.

The availability issue is not as simple as the formula discussed above. First, MTBF is just a trend. For example, if a CPU has an MTBF of 500,000 hours, it does not mean that this CPU will fail after 57 years of use; this CPU can fail at any time. Second, there are many components in a system, and every component has a different MTBF and MTTR. These variations make system availability unpredictable using the formula above. We can build a simulation model for an end-to-end WebSphere system's availability with random process theory such as Markov chains, but this is beyond the scope of this book.

For a WebSphere production system, the availability becomes much more complicated, since a WebSphere system includes many components, such as firewall, Load Balancer (LB), Web server, WAS and administrative servers (Node Agent and Deployment Manager), administrative repository, JMS server, log files, session persistent database, application database, and LDAP server and database.

Usually, redundant hardware and clustering software are used to achieve high availability. Our goal is to minimize the MTTR through various clustering techniques; if MTTR=0, A=100% no matter what the MTBF is. Using this approach, system availability becomes predictable and manageable.

  Prev | Home | Next

 

WebSphere is a trademark of the IBM Corporation in the United States, other countries, or both.

 

IBM is a trademark of the IBM Corporation in the United States, other countries, or both.