IBM


1.2.1 Levels of availability

First of all, availability is closely related to cost, as shown in Figure 1-1. It is important to balance the downtime with cost. Normally, the more you invest, the less downtime there is. Therefore, it is also very important for you to evaluate what you will lose if your WebSphere service is temporarily unavailable. Different businesses have different costs of downtime, and some businesses, such as financial services, might lose millions of dollars for each hour of downtime during business hours. Costs for the downtime include not only direct money.

Figure 1-1 Levels of availability and costs

Redundant hardware and clustering software are approaches to high availability. We can divide availability at the following levels:

1. Basic systems.

Basic systems do not employ any special measures to protect data and services, although backups are taken regularly. When an outage occurs, the support personnel restores the system from the backup.

2. Redundant data.

Disk redundancy or disk mirroring are used to protect the data against the loss of a disk. Full disk mirroring provides more data protection than RAID-5.

3. Component failover.

For an infrastructure such as WebSphere, there are many components. An outage in any component can result in service interruption. Multiple threads or multiple instances can be employed for availability purposes. For example, if you do not make the firewall component highly available, it might cause the whole system to go down - worse than that, it might expose your system to hackers - even though the servers are highly available.

IBM WAS ND V6 provides process high availability (using vertically scaled appserver clusters) and process and node high availability (using horizontally scaled clusters). Highly available data management is critical for a highly available transactional system.

Below, the system availability seen by the client would be 85%.

4. System failover.

A standby or backup system is used to take over for the primary system if the primary system fails. In principle, any kind of service can become highly available by employing system failover techniques. However, this will not work if the software is hard-coded to physical host-dependent variables.

In system failover, clustering software monitors the health of the network, hardware, and software process, detects and communicates any fault, and automatically fails over the service and associated resources to a healthy host. Therefore, you can continue the service before you repair the failed system.

You can configure the systems as Active/Active mutual takeover or Active/Passive takeover. Although the Active/Active mutual takeover configuration increases the usage of hardware, it also increases the possibility of interruption, and hence reduces the availability. In addition, it is not efficient to include all components into a single cluster system. You can have a firewall cluster, an LDAP cluster, WebSphere server cluster, and database cluster.

You can use system failover for planned software and hardware maintenance and upgrades.

5. Disaster recovery.

This applies to maintaining systems in different sites. When the primary site becomes unavailable due to disasters, the backup site can become operational within a reasonable time. This can be done manually through regular data backups, or automatically by geographical clustering software.

Continuous availability means that high availability and continuous operations are required to eliminate all planned downtime.

Next