|
The causes of downtime can be either planned events or unplanned events. Planned events can account for as much as 30% of downtime. As mentioned before, rolling upgrades and hot replacements can reduce the planned downtime. However, the most important issue is how to minimize the unplanned downtime, because nobody knows when the unplanned downtime occurs and all businesses require the system to be up during business hours.
Studies have shown that software failures and human error are responsible for a very high percentage of unplanned downtime. Software failures include network software failure, server software failure, and client software failure. Human errors could be related to missing skills but also to the fact that system management is not easy-to-use.
Hardware failures and environmental problems also account for unplanned downtime, although by far not as much as the other factors. Using functions such as state-of-the-art LPAR capabilities with self-optimizing resource adjustments, Capacity on Demand (to avoid overloading of systems), and redundant hardware in the systems (to avoid single points of failure), hardware failures can be further reduced. You can find more information about LPAR for the IBM eServer iSeries and pSeries systems in the following resources:
You can find information about Capacity on Demand at:
http://www.ibm.com/servers/eServer/about/cod/
Many environmental problems are data center related. Having a locally located standby might not suffice, because the entire site environment might be affected. Geographic clustering and data replication can minimize downtime caused by such environmental problems.
The end-to-end WebSphere high availability system that eliminates a single point of failure for all parts of the system can minimize both planned and unplanned downtime. We describe the implementation of such a WebSphere high availability system throughout this book.