1.2 Availability definition
Availability is a measure of the time that a server is functioning normally, as well as a measure of the time the recovery process requires after the system fails. In other words, it is the downtime that defines system availability. This downtime includes both planned and unplanned downtime.
Let A be an index of system availability expressed as a percentage, MTBF the mean time between failures, and MTTR the maximum time to recover the system from failures. Thus, we have:
A = MTBF/(MTBF + MTTR)
As MTBF gets larger, A increases and MTTR has less impact on A. As MTTR approaches zero, A increases toward 100%. This means that if we can recover from failures very quickly, we have a highly available system. The time to recover a system includes fault detection time and system recovery time.
Clustering software uses fault detection mechanisms and automatically fails over the services to a healthy host to minimize the fault detection time and the service recovery time. MTTR is minimized because the fault detection time is minimized and no repair attempt is needed. Therefore, A is significantly raised. Any repairs to the failed node and any upgrades of software and hardware will not impact the service availability. This is the so-called hot replacement or rolling upgrade.
The availability issue is not as simple as the formula discussed above. First, MTBF is just a trend. For example, if a CPU has an MTBF of 500,000 hours, it does not mean that this CPU will fail after 57 years of use. In reality, this CPU can fail at any time. Second, there are many components in a system, and every component has a different MTBF and MTTR. These variations make system availability unpredictable using the formula above. We can build a simulation model for an end-to-end WebSphere system's availability with a random process theory such as Markov chains, but this topic is beyond the scope of this book.
For a WebSphere production system, the availability becomes much more complicated, because a WebSphere production system includes many components, such as firewalls, Load Balancers, Web servers, appservers and administrative servers (Node Agent and Deployment Manager), the administrative repository, log files, the persistent session database, application database or databases, and LDAP directory server and database. System availability is determined by the weakest point in the WebSphere production environment.
Usually, redundant hardware and clustering software are used to achieve high availability. Our goal is to minimize the MTTR through various HA techniques. That is, if MTTR=0, then A=100%, no matter what the MTBF is. Using this approach, system availability becomes predictable and manageable.