Disasters
Hardware Failures
At its simplest, exposure due to hardware failures can be reduced by having spare hardware available. Of course, this approach assumes two things:
- Someone on-site has the necessary skills to diagnose the problem, identify the failing hardware, and replace it.
- A replacement for the failing hardware is available.
Before taking the approach of first fixing it yourself, make sure that the hardware in question:
- Is not still under warranty
- Is not under a service/maintenance contract of any kind
If you attempt repairs on hardware that is covered by a warranty and/or service contract, you are likely violating the terms of these agreements and jeopardizing your continued coverage. When considering what hardware to stock, here are some of the issues you should keep in mind:
- Maximum allowable downtime
- The skill required to make the repair
- Budget available for spares
- Storage space required for spares
- Other hardware that could utilize the same spares
Each of these issues has a bearing on the types of spares that should be stocked. For example, stocking complete systems would tend to minimize downtime and require minimal skills to install but would be much more expensive than having a spare CPU and RAM module on a shelf. However, this expense might be worthwhile if your organization has several dozen identical servers that could benefit from a single spare system.
No matter what the final decision, the following question is inevitable and is discussed next.
How Much to Stock?
The question of spare stock levels is also multi-faceted. Here the main issues are:
- Maximum allowable downtime
- Projected rate of failure
- Estimated time to replenish stock
- Budget available for spares
- Storage space required for spares
- Other hardware that could utilize the same spares
At one extreme, for a system that can afford to be down a maximum of two days, and a spare that might be used once a year and could be replenished in a day, it would make sense to carry only one spare (and maybe even none, if you were confident of your ability to secure a spare within 24 hours).
At the other end of the spectrum, a system that can afford to be down no more than a few minutes, and a spare that might be used once a month (and could take several weeks to replenish) might mean that a half dozen spares (or more) should be on the shelf.
Spares That Are Not Spares
When is a spare not a spare? When it is hardware that is in day-to-day use but is also available to serve as a spare for a higher-priority system should the need arise. This approach has some benefits:
- Less money dedicated to "non-productive" spares
- The hardware is known to be operative
There are, however, downsides to this approach:
- Normal production of the lower-priority task is interrupted
- There is an exposure should the lower-priority hardware fail (leaving no spare for the higher-priority hardware)
Given these constraints, the use of another production system as a spare may work, but the success of this approach hinges on the system's specific workload and the impact the system's absence has on overall data center operations.
Service Contracts
Service contracts make the issue of hardware failures someone else's problem. All that is necessary for you to do is to confirm that a failure has, in fact, occurred and that it does not appear to have a software-related cause. You then make a telephone call, and someone shows up to make things right again.
It seems so simple. But as with most things in life, there is more to it than meets the eye. Here are some things that consider when looking at a service contract:
- Hours of coverage
- Response time
- Parts availability
- Available budget
- Hardware to be covered
We explore each of these details more closely below.
Hours of Coverage
Different service contracts are available to meet different needs; one of the big variables between different contracts relates to the hours of coverage. Unless you are willing to pay a premium for the privilege, you cannot just call any time and expect to see a technician at your door a short time later.
Instead, depending on your contract, you might find that you cannot even phone the service company until a specific day/time, or if you can, they will not dispatch a technician until the day/time specified for your contract.
Most hours of coverage are defined in terms of the hours and the days during which a technician may be dispatched. Some of the more common hours of coverage are:
- Monday through Friday, 09:00 to 17:00
- Monday through Friday, 12/18/24 hours each day (with the start and stop times mutually agreed upon)
- Monday through Saturday (or Monday through Sunday), same times as above
As you might expect, the cost of a contract increases with the hours of coverage. In general, extending the coverage Monday through Friday tends to cost less than adding on Saturday and Sunday coverage.
But even here there is a possibility of reducing costs if you are willing to do some of the work.
Depot Service
If your situation does not require anything more than the availability of a technician during standard business hours and you have sufficient experience to be able to determine what is broken, you might consider looking at depot service. Known by many names (including walk-in service and drop-off service), manufacturers may have service depots where technicians work on hardware brought in by customers.
Depot service has the benefit of being as fast as you are. You do not have to wait for a technician to become available and show up at your facility. Depot technicians do not go out on customer calls, meaning that there will be someone to work on your hardware as soon as you can get it to the depot.
Because depot service is done at a central location, there is a good chance that any required parts will be available. This can eliminate the need for an overnight shipment or waiting for a part to be driven several hundred miles from another office that just happened to have that part in stock.
There are some trade-offs, however. The most obvious is that you cannot choose the hours of service — you get service when the depot is open. Another aspect to this is that the technicians do not work past their quitting time, so if your system failed at 16:30 on a Friday and you got the system to the depot by 17:00, it will not be worked on until the technicians arrive at work the following Monday morning.
Another trade-off is that depot service depends on having a depot nearby. If your organization is located in a metropolitan area, this is likely not going to be a problem. However, organizations in more rural locations may find that a depot is a long drive away.
If considering depot service, take a moment and consider the mechanics of actually getting the hardware to the depot. Will you be using a company vehicle or your own? If your own, does your vehicle have the necessary space and load capacity? What about insurance? Will more than one person be necessary to load and unload the hardware?
Although these are rather mundane concerns, they should be addressed before making the decision to use depot service.
Response Time
In addition to the hours of coverage, many service agreements specify a level of response time. In other words, when you call requesting service, how long will it be before a technician arrives? As you might imagine, a faster response time equates to a more expensive service agreement.
There are limits to the response times that are available. For instance, the travel time from the manufacturer's office to your facility has a large bearing on the response times that are possible [1]. Response times in the four hour range are usually considered among the quicker offerings. Slower response times can range from eight hours (which effectively becomes "next day" service for a standard business hours agreement), to 24 hours. As with every other aspect of a service agreement, even these times are negotiable — for the right price.
Although it is not a common occurrence, you should be aware that service agreements with response time clauses can sometimes stretch a manufacturer's service organization beyond its ability to respond. It is not unheard of for a very busy service organization to send somebody — anybody — on a short response-time service call just to meet their response time commitment. This person apparently diagnoses the problem, calling "the office" to have someone bring "the right part."
In fact, they are just waiting until someone who is actually capable of handling the call arrives.
While it might be understandable to see this happen under extraordinary circumstances (such as power problems that have damaged systems throughout their service area), if this is a consistent method of operation you should contact the service manager and demand an explanation.
If your response time needs are stringent (and your budget correspondingly large), there is one approach that can cut your response times even further — to zero.
Zero Response Time — Having an On-Site Technician
Given the appropriate situation (you are one of the biggest customers in the area), sufficient need (downtime of any magnitude is unacceptable), and financial resources (if you have to ask for the price, you probably cannot afford it), you might be a candidate for a full-time, on-site technician. The benefits of having a technician always standing by are obvious:
- Instant response to any problem
- A more proactive approach to system maintenance
As you might expect, this option can be very expensive, particularly if you require an on-site technician 24x7. But if this approach is appropriate for your organization, you should keep a number of points in mind in order to gain the most benefit.
First, on-site technicians need many of the resources of a regular employee, such as a workspace, telephone, appropriate access cards and/or keys, and so on.
On-site technicians are not very helpful if they do not have the proper parts. Therefore, make sure that secure storage is set aside for the technician's spare parts. In addition, make sure that the technician keeps a stock of parts appropriate for your configuration and that those parts are not routinely "cannibalized" by other technicians for their customers.
Parts Availability
Obviously, the availability of parts plays a large role in limiting your organization's exposure to hardware failures. In the context of a service agreement, the availability of parts takes on another dimension, as the availability of parts applies not only to your organization, but to any other customer in the manufacturer's territory that might need those parts as well. Another organization that has purchased more of the manufacturer's hardware than you might get preferential treatment when it comes to getting parts (and technicians, for that matter).
Unfortunately, there is little that can be done in such circumstances, short of working out the problem with the service manager.
Available Budget
As outlined above, service contracts vary in price according to the nature of the services being provided. Keep in mind that the costs associated with a service contract are a recurring expense; each time the contract is due to expire negotiate a new contract and pay again.
Hardware to be Covered
Here is an area where you might be able to help keep costs to a minimum. Consider for a moment that you have negotiated a service agreement that has an on-site technician 24x7, on-site spares — you name it. Every single piece of hardware you have purchased from this vendor is covered, including the PC that the company receptionist uses to surf the Web while answering phones and handing out visitor badges.
Does that PC really need to have someone on-site 24x7? Even if the PC were vital to the receptionist's job, the receptionist only works from 09:00 to 17:00; it is highly unlikely that:
- The PC will be in use from 17:00 to 09:00 the next morning (not to mention weekends)
- A failure of this PC will be noticed, except between 09:00 and 17:00
Therefore, paying on the chance that this PC might need to be serviced in the middle of a Saturday night is a waste of money.
The thing to do is to split up the service agreement such that non-critical hardware is grouped separately from more critical hardware. In this way, costs can be kept as low as possible.
If you have twenty identically-configured servers that are critical to your organization, you might be tempted to have a high-level service agreement written for only one or two, with the rest covered by a much less expensive agreement. Then, the reasoning goes, no matter which one of the servers fails on a weekend, you will say that it is the one eligible for high-level service.
Do not do this. Not only is it dishonest, most manufacturers keep track of such things by using serial numbers. Even if you figure out a way around such checks, you will spend far more after being discovered than you will by being honest and paying for the service you really need.
Software Failures
Software failures can result in extended downtimes. For example, owners of a certain brand of computer systems noted for their high-availability features recently experienced this firsthand. A bug in the time handling code of the computer's operating system resulted in each customer's systems crashing at a certain time of a certain day. While this particular situation is a more spectacular example of a software failure in action, other software-related failures may be less dramatic, but still as devastating.
Software failures can strike in one of two areas:
- Operating system
- Applications
Each type of failure has its own specific impact and is explored in more detail in the following sections.
Operating System Failures
In this type of failure, the operating system is responsible for the disruption in service. Operating system failures come from two areas:
- Crashes
- Hangs
The main thing to keep in mind about operating system failures is that they take out everything that the computer was running at the time of the failure. As such, operating system failures can be devastating to production.
Crashes
Crashes occur when the operating system experiences an error condition from which it cannot recover. The reasons for crashes can range from an inability to handle an underlying hardware problem to a bug in the kernel-level code comprising the operating system. When an operating system crashes, the system must be rebooted in order to continue production.
Hangs
When the operating system stops handling system events, the system grinds to a halt. This is known as a hang. Hangs can be caused by deadlocks (two resource consumers contending for resources the other has) and livelocks (two or more processes responding to each other's activities, but doing no useful work), but the end result is the same — a complete lack of productivity.
Application Failures
Unlike operating system failures, application failures can be more limited in the scope of their damage. Depending on the specific application, a single application failing might impact only one person. On the other hand, if it is a server application servicing a large population of client applications, the consequences of a failure would be much more widespread.
Application failures, like operating system failures, can be due to hangs and crashes; the only difference is that here it is the application that is hanging or crashing.
Getting Help — Software Support
Just as hardware vendors provide support for their products, many software vendors make support packages available to their customers. Except for the obvious differences (no spare hardware is required, and most of the work can be done by support personnel over the phone), software support contracts can be quite similar to hardware support contracts.
The level of support provided by a software vendor can vary. Here are some of the more common support strategies employed today:
- Documentation
- Self support
- Web or email support
- Telephone support
- On-site support
Each type of support is described in more detail below.
Documentation
Although often overlooked, software documentation can serve as a first-level support tool. Whether online or printed, documentation often contains the information necessary to resolve many issues.
Self Support
Self support relies on the customer using online resources to resolve their own software-related issues. Quite often these resources take the form of Web-based FAQs (Frequently Asked Questions) or knowledge bases.
FAQs often have little or no selection capabilities, leaving the customer to scroll through question after question in the hopes of finding one that addresses the issue at hand. Knowledge bases tend to be somewhat more sophisticated, allowing the entry of search terms. Knowledge bases can also be quite extensive in scope, making it a good tool for resolving problems.
Web or Email Support
Many times what looks like a self support website also includes Web-based forms or email addresses that make it possible to send questions to support staff. While this might at first glance appear to be an improvement over a good self support website, it really depends on the people answering the email.
If the support staff is overworked, it is difficult to get the necessary information from them, as their main concern is to quickly respond to each email and move on to the next one. The reason for this is because nearly all support personnel are evaluated by the number of issues that they resolve. Escalation of issues is also difficult because there is little that can be done within an email to encourage more timely and helpful responses — particularly when the person reading your email is in a hurry to move on to the next one.
The way to get the best service is to make sure that your email addresses all the questions that a support technician might ask, such as:
- Clearly describe the nature of the problem
- Include all pertinent version numbers
- Describe what you have already done in an attempt to address the problem (applied the latest patches, rebooted with a minimal configuration, etc.)
By giving the support technician more information, you stand a better chance of getting the support you need.
Telephone Support
As the name implies, telephone support entails speaking to a support technician via telephone. This style of support is most similar to hardware support; that there can be various levels of support available (with different hours of coverage, response time, etc.).
On-Site Support
Also known as on-site consulting, this is the most expensive approach to software support. Normally it is reserved for resolving specific issues or making critical changes, such as initial software installation and configuration, major upgrades, and so on. As expected, this is the most expensive type of software support available.
Still, there are instances where on-site support makes sense. As an example, consider a small organization with a single system administrator. The organization is going to be deploying its first database server, but the deployment (and the organization) is not large enough to justify hiring a dedicated database administrator. In this situation, it can often be cheaper to bring in a specialist from the database vendor to handle the initial deployment (and occasionally later on, as the need arises) then it would be to train the system administrator in a skill that will be seldom used.
Environmental Failures
Even though the hardware may be running perfectly, and even though the software may be configured properly and is working as it should, problems can still occur. The most common problems that occur outside of the system itself have to do with the physical environment in which the system resides.
Environmental issues can be broken into four major categories:
- Building integrity
- Electricity
- Air conditioning
- Weather and the outside world
Building Integrity
For such a seemingly simple structure, a building performs a great many functions. It provides shelter from the elements. It provides the proper micro-climate for the building's contents. It has mechanisms to provide power and to protect against fire, theft, and vandalism. Performing all these functions, it is not surprising that there is a great deal that can go wrong with a building. Here are some possibilities to consider:
- Roofs can leak, allowing water into data centers.
- Various building systems (such as water, sewer, or air handling) can fail, rendering the building uninhabitable.
- Floors may have insufficient load-bearing capacity to hold the equipment you want to put in the data center.
It is important to have a creative mind when it comes to thinking about the different ways buildings can fail. The list above is only meant to start you thinking along the proper lines.
Electricity
Because electricity is the lifeblood of any computer system, power-related issues are paramount in the mind of system administrators everywhere. There are several different aspects to power; they are covered in more detail in the following sections.
The Security of Your Power
First, it is necessary to determine how secure your normal power supply may be. Just like nearly every other data center, you probably obtain your power from a local power company via power transmission lines. Because of this, there are limits to what you can do to make sure that your primary power supply is as secure as possible.
Organizations located near the boundaries of a power company might be able to negotiate connections to two different power grids:
- The one servicing your area
- The one from the neighboring power company
The costs involved in running power lines from the neighboring grid are sizable, making this an option only for larger organizations. However, such organizations find that the redundancy gained outweigh the costs in many cases.
The main things to check are the methods by which the power is brought onto your organization's property and into the building. Are the transmission lines above ground or below? Above-ground lines are susceptible to:
- Damage from extreme weather conditions (ice, wind, lightning)
- Traffic accidents that damage the poles and/or transformers
- Animals straying into the wrong place and shorting out the lines
However, below-ground lines have their own unique shortcomings:
- Damage from construction workers digging in the wrong place
- Flooding
- Lightning (though much less so than above-ground lines)
Continue to trace the power lines into your building. Do they first go to an outside transformer? Is that transformer protected from vehicles backing into it or trees falling on it? Are all exposed shutoff switches protected against unauthorized use?
Once inside your building, could the power lines (or the panels to which they attach) be subject to other problems? For instance, could a plumbing problem flood the electrical room?
Continue tracing the power into the data center; is there anything else that could unexpectedly interrupt your power supply? For example, is the data center sharing one or more circuits with non-data center loads? If so, the external load might one day trip the circuit's overload protection, taking down the data center as well.
Power Quality
It is not enough to ensure that the data center's power source is as secure as possible. You must also be concerned with the quality of the power being distributed throughout the data center. There are several factors that must be considered:
- Voltage
The voltage of the incoming power must be stable, with no voltage reductions (often called sags, droops, or brownouts) or voltage increases (often known as spikes and surges).
- Waveform
The waveform must be a clean sine wave, with minimal THD (Total Harmonic Distortion).
- Frequency
The frequency must be stable (most countries use a power frequency of either 50Hz or 60Hz).
- Noise
The power must not include any RFI (Radio Frequency Interference) or EMI (Electro-Magnetic Interference) noise.
- Current
The power must be supplied at a current rating sufficient to run the data center.
Power supplied directly from the power company does not normally meet the standards necessary for a data center. Therefore, some level of power conditioning is usually required. There are several different approaches possible:
- Surge Protectors
Surge protectors do just what their name implies — they filter surges from the power supply. Most do nothing else, leaving equipment vulnerable to damage from other power-related problems.
- Power Conditioners
Power conditioners attempt a more comprehensive approach; depending on the sophistication of the unit, power conditioners often can take care of most of the types of problems outlined above.
- Motor-Generator Sets
A motor-generator set is essentially a large electric motor powered by your normal power supply. The motor is attached to a large flywheel, which is, in turn, attached to a generator. The motor turns the flywheel and generator, which generates electricity in sufficient quantities to run the data center. In this way, the data center power is electrically isolated from outside power, meaning that most power-related problems are eliminated. The flywheel also provides the ability to maintain power through short outages, as it takes several seconds for the flywheel to slow to the point at which it can no longer generate power.
- Uninterruptible Power Supplies
Some types of Uninterruptible Power Supplies (more commonly known as UPSs) include most (if not all) of the protection features of a power conditioner [2].
With the last two technologies listed above, we have started in on the topic most people think of when they think about power — backup power. In the next section, different approaches to providing backup power are explored.
Backup Power
One power-related term that nearly everyone has heard is the term blackout. A blackout is a complete loss of electrical power and may last from a fraction of a second to weeks.
Because the length of blackouts can vary so greatly, it is necessary to approach the task of providing backup power using different technologies for power outages of different lengths.
The most frequent blackouts last, on average, no more than a few seconds; longer outages are much less frequent. Therefore, concentrate first on protecting against blackouts of only a few minutes in duration, then work out methods of reducing your exposure to longer outages.
Providing Power For the Next Few Seconds
Since the majority of outages last only a few seconds, your backup power solution must have two primary characteristics:
- Very short time to switch to backup power (known as transfer time)
- A runtime (the time that backup power will last) measured in seconds to minutes
The backup power solutions that match these characteristics are motor-generator sets and UPSs. The flywheel in the motor-generator set allows the generator to continue producing electricity for enough time to ride out outages of a second or so. Motor-generator sets tend to be quite large and expensive, making them a practical solution only for mid-sized and larger data centers.
However, another technology — called a UPS — can fill in for those situations where a motor-generator set is too expensive. It can also handle longer outages.
Providing Power For the Next Few Minutes
UPSs can be purchased in a variety of sizes — small enough to run a single low-end PC for five minutes or large enough to power an entire data center for an hour or more.
UPSs are made up of the following parts:
- A transfer switch for switching from the primary power supply to the backup power supply
- A battery, for providing backup power
- An inverter, which converts the DC current from the battery into the AC current required by the data center hardware
Apart from the size and battery capacity of the unit, UPSs come in two basic types:
- The offline UPS uses its inverter to generate power only when the primary power supply fails.
- The online UPS uses its inverter to generate power all the time, powering the inverter via its battery only when the primary power supply fails.
Each type has their advantages and disadvantages. The offline UPS is usually less expensive, because the inverter does not have to be constructed for full-time operation. However, a problem in the inverter of an offline UPS will go unnoticed (until the next power outage, that is).
Online UPSs tend to be better at providing clean power to your data center; after all, an online UPS is essentially generating power for you full time.
But no matter what type of UPS you choose, properly size the UPS to your anticipated load (thereby ensuring that the UPS has sufficient capacity to produce electricity at the required voltage and current), and determine how long you would like to be able to run your data center on battery power.
To determine this information, first identify those loads that are to be serviced by the UPS. Go to each piece of equipment and determine how much power it draws (this is normally listed on a label near the unit's power cord). Write down the voltage, watts, and/or amps. Once you have these figures for all of the hardware, convert them to VA (Volt-Amps). If you have a wattage number, you can use the listed wattage as the VA; if you have amps, multiply it by volts to get VA. By adding the VA figures you can arrive at the approximate VA rating required for the UPS.
Strictly speaking, this approach to calculating VA is not entirely correct; however, to get the true VA you would need to know the power factor for each unit, and this information is rarely, if ever, provided. In any case, the VA numbers obtained from this approach reflects worst-case values, leaving a large margin of error for safety.
Determining runtime is more of a business question than a technical question — what sorts of outages are you willing to protect against, and how much money are you prepared to spend to do so? Most sites select runtimes that are less than an hour or two at most, as battery-backed power becomes very expensive beyond this point.
Providing Power For the Next Few Hours (and Beyond)
Once we get into power outages that are measured in days, the choices get even more expensive. The technologies capable of handling long-term power outages are limited to generators powered by some type of engine — diesel and gas turbine, primarily.
Keep in mind that engine-powered generators require regular refueling while they are running. You should know your generator's fuel "burn" rate at maximum load and arrange fuel deliveries accordingly.
At this point, your options are wide open, assuming your organization has sufficient funds. This is also an area where experts should help you determine the best solution for your organization. Very few system administrators have the specialized knowledge necessary to plan the acquisition and deployment of these kinds of power generation systems.
Portable generators of all sizes can be rented, making it possible to have the benefits of generator power without the initial outlay of money necessary to purchase one. However, keep in mind that in disasters affecting your general vicinity, rented generators will be in very short supply and very expensive.
Planning for Extended Outages
While a black out of five minutes is little more than an inconvenience to the personnel in a darkened office, what about an outage that lasts an hour? Five hours? A day? A week?
The fact is, at some point even if the data center is operating normally, an extended outage will eventually affect your organization. Consider the following points:
- What if there is no power to maintain environmental control in the data center?
- What if there is no power to maintain environmental control in the entire building?
- What if there is no power to operate personal workstations, the telephone system, the lights?
The point here is that your organization must determine at what point an extended outage will just have to be tolerated. Or if that is not an option, your organization must reconsider its ability to function completely independently of on-site power for extended periods, meaning that very large generators will be needed to power the entire building.
Of course, even this level of planning cannot take place in a vacuum. It is very likely that whatever caused the extended outage is also affecting the world outside your organization, and that the outside world will start having an affect on your organization's ability to continue operations, even given unlimited power generation capacity.
Heating, Ventilation, and Air Conditioning
The Heating, Ventilation, and Air Conditioning (HVAC) systems used in today's office buildings are incredibly sophisticated. Often computer controlled, the HVAC system is vital to providing a comfortable work environment.
Data centers usually have additional air handling equipment, primarily to remove the heat generated by the many computers and associated equipment. Failures in an HVAC system can be devastating to the continued operation of a data center. And given their complexity and electro-mechanical nature, the possibilities for failure are many and varied. Here are a few examples:
- The air handling units (essentially large fans driven by large electric motors) can fail due to electrical overload, bearing failure, belt/pulley failure, etc.
- The cooling units (often called chillers) can lose their refrigerant due to leaks, or they can have their compressors and/or motors seize.
HVAC repair and maintenance is a very specialized field — a field that the average system administrator should leave to the experts. If anything, a system administrator should make sure that the HVAC equipment serving the data center is checked for normal operation on a daily basis (if not more frequently) and is maintained according to the manufacturer's guidelines.
Weather and the Outside World
There are some types of weather that can cause problems for a system administrator:
- Heavy snow and ice can prevent personnel from getting to the data center, and can even clog air conditioning condensers, resulting in elevated data center temperatures just when no one is able to get to the data center to take corrective action.
- High winds can disrupt power and communications, with extremely high winds actually doing damage to the building itself.
There are other types of weather than can still cause problems, even if they are not as well known. For example, exceedingly high temperatures can result in overburdened cooling systems, and brownouts or blackouts as the local power grid becomes overloaded.
Although there is little that can be done about the weather, knowing the way that it can affect your data center operations can help you to keep things running even when the weather turns bad.
End-User Errors
The users of a computer can make mistakes that can have serious impact. However, due to their normally unprivileged operating environment, user errors tend to be localized in nature. Because most users interact with a computer exclusively through one or more applications, it is within applications that most end-user errors occur.
Improper Use of Applications
When applications are used improperly, various problems can occur:
- Files inadvertently overwritten
- Wrong data used as input to an application
- Files not clearly named and organized
- Files accidentally deleted
The list could go on, but this is enough to illustrate the point. Due to users not having super-user privileges, the mistakes they make are usually limited to their own files. As such, the best approach is two-pronged:
- Educate users in the proper use of their applications and in proper file management techniques
- Make sure backups of users' files are made regularly and that the restoration process is as streamlined and quick as possible
Beyond this, there is little that can be done to keep user errors to a minimum.
Operations Personnel Errors
Operators have a more in-depth relationship with an organization's computers than end-users. Where end-user errors tend to be application-oriented, operators tend to perform a wider range of tasks. Although the nature of the tasks have been dictated by others, some of these tasks can include the use of system-level utilities, where the potential for widespread damage due to errors is greater. Therefore, the types of errors that an operator might make center on the operator's ability to follow the procedures that have been developed for the operator's use.
Failure to Follow Procedures
Operators should have sets of procedures documented and available for nearly every action they perform [3]. It might be that an operator does not follow the procedures as they are laid out. There can be several reasons for this:
- The environment was changed at some time in the past, and the procedures were never updated. Now the environment changes again, rendering the operator's memorized procedure invalid. At this point, even if the procedures were updated (which is unlikely, given the fact that they were not updated before) the operator will not be aware of it.
- The environment was changed, and no procedures exist. This is just a more out-of-control version of the previous situation.
- The procedures exist and are correct, but the operator will not (or cannot) follow them.
Depending on the management structure of your organization, you might not be able to do much more than communicate your concerns to the appropriate manager. In any case, making yourself available to do what you can to help resolve the problem is the best approach.
Mistakes Made During Procedures
Even if the operator follows the procedures, and even if the procedures are correct, it is still possible for mistakes to be made. If this happens, the possibility exists that the operator is careless (in which case the operator's management should become involved).
Another explanation is that it was just a mistake. In these cases, the best operators realize that something is wrong and seek assistance. Always encourage the operators you work with to contact the appropriate people immediately if they suspect something is wrong. Although many operators are highly-skilled and able to resolve many problems independently, the fact of the matter is that this is not their job. And a problem that is made worse by a well-meaning operator harms both that person's career and your ability to quickly resolve what might originally have been a small problem.
System Administrator Errors
Unlike operators, system administrators perform a wide variety of tasks using an organization's computers. Also unlike operators, the tasks that system administrators perform are often not based on documented procedures.
Therefore, system administrators sometimes make unnecessary work for themselves when they are not careful about what they are doing. During the course of carrying out day-to-day responsibilities, system administrators have more than sufficient access to the computer systems (not to mention their super-user access privileges) to mistakenly bring systems down.
System administrators either make errors of misconfiguration or errors during maintenance.
Misconfiguration Errors
System administrators must often configure various aspects of a computer system. This configuration might include:
- User accounts
- Network
- Applications
The list could go on quite a bit longer. The actual task of configuration varies greatly; some tasks require editing a text file (using any one of a hundred different configuration file syntaxes), while other tasks require running a configuration utility.
The fact that these tasks are all handled differently is merely an additional challenge to the basic fact that each configuration task itself requires different knowledge. For example, the knowledge required to configure a mail transport agent is fundamentally different from the knowledge required to configure a new network connection.
Given all this, perhaps it should be surprising that so few mistakes are actually made. In any case, configuration is, and will continue to be, a challenge for system administrators. Is there anything that can be done to make the process less error-prone?
Change Control
The common thread of every configuration change is that some sort of a change is being made. The change may be large, or it may be small. But it is still a change and should be treated in a particular way.
Many organizations implement some type of change control process. The intent is to help system administrators (and all parties affected by the change) to manage the process of change and to reduce the organization's exposure to any errors that may occur.
A change control process normally breaks the change into different steps. Here is an example:
- Preliminary research
Preliminary research attempts to clearly define:
- The nature of the change to take place
- Its impact, should the change succeed
- A fallback position, should the change fail
- An assessment of what types of failures are possible
Preliminary research might include testing the proposed change during a scheduled downtime, or it may go so far as to include implementing the change first on a special test environment run on dedicated test hardware.
- Scheduling
The change is examined with an eye toward the actual mechanics of implementation. The scheduling being done includes outlining the sequencing and timing of the change (along with the sequencing and timing of any steps necessary to back the change out should a problem arise), as well as ensuring that the time allotted for the change is sufficient and does not conflict with any other system-level activity.
The product of this process is often a checklist of steps for the system administrator to use while making the change. Included with each step are instructions to perform in order to back out the change should the step fail. Estimated times are often included, making it easier for the system administrator to determine whether the work is on schedule or not.
- Execution
At this point, the actual execution of the steps necessary to implement the change should be straightforward and anti-climactic. The change is either implemented, or (if trouble crops up) it is backed out.
- Monitoring
Whether the change is implemented or not, the environment is monitored to make sure that everything is operating as it should.
- Documenting
If the change has been implemented, all existing documentation is updated to reflect the changed configuration.
Obviously, not all configuration changes require this level of detail. Creating a new user account should not require any preliminary research, and scheduling would likely consist of determining whether the system administrator has a spare moment to create the account. Execution would be similarly quick; monitoring might consist of ensuring that the account was usable, and documenting would probably entail sending an email to the new user's manager.
But as configuration changes become more complex, a more formal change control process becomes necessary.
Mistakes Made During Maintenance
This type of error can be insidious because there is usually so little planning and tracking done during day-to-day maintenance.
System administrators see the results of this kind of error every day, especially from the many users that swear they did not change a thing — the computer just broke. The user that says this usually does not remember what they did, and when the same thing happens to you, you will probably not remember what you did, either.
The key thing to keep in mind is that be able to remember what changes you made during maintenance if you are to be able to resolve any problems quickly. A full-blown change control process is not realistic for the hundreds of small things done over the course of a day. What can be done to keep track of the 101 small things a system administrator does every day?
The answer is simple — takes notes. Whether it is done in a paper notebook, a PDA, or as comments in the affected files, take notes. By tracking what you have done, you stand a better chance of seeing a failure as being related to a change you recently made.
Service Technician Errors
Sometimes the very people that are supposed to help you keep your systems running reliably can actually make things worse. This is not due to any conspiracy; it is just that anyone working on any technology for any reason risks rendering that technology inoperable. The same effect is at work when programmers fix one bug but end up creating another.
Improperly-Repaired Hardware
In this case, the technician either failed to correctly diagnose the problem and made an unnecessary (and useless) repair, or the diagnosis was correct, but the repair was not carried out properly. It may be that the replacement part was itself defective, or that the proper procedure was not followed when the repair was carried out.
This is why it is important to be aware of what the technician is doing at all times. By doing this, you can keep an eye out for failures that seem to be related to the original problem in some way. This keeps the technician on track should there be a problem; otherwise there is a chance that the technician will view this fault as being new and unrelated to the one that was supposedly fixed. In this way, time is not wasted chasing the wrong problem.
Fixing One Thing and Breaking Another
Sometimes, even though a problem was diagnosed and repaired successfully, another problem pops up to take its place. The CPU module was replaced, but the anti-static bag it came in was left in the cabinet, blocking the fan and causing an over-temperature shutdown. Or the failing disk drive in the RAID array was replaced, but because a connector on another drive was bumped and accidentally disconnected, the array is still down.
These things might be the result of chronic carelessness or an honest mistake. It does not matter. What you should always do is to carefully review the repairs made by the technician and ensure that the system is working properly before letting the technician
And this would likely be considered a best-case response time, as technicians usually are responsible for territories that extend away from their office in all directions. If you are at one end of their territory and the only available technician is at the other end, the response time will be even longer.
If the operators at your organization do not have a set of operating procedures, work with them, your management, and your users to get them created. Without them, a data center is out of control and likely to experience severe problems in the course of day-to-day operations.
Useful Websites
- http://www.redhat.com/apps/support/ — The Red Hat support homepage provides easy access to various resources related to the support of Red Hat Linux.
- http://www.disasterplan.com/ — An interesting page with links to many sites related to disaster recovery. Includes a sample disaster recovery plan.
- http://web.mit.edu/security/www/isorecov.htm — The Massachusetts Institute of Technology Information Systems Business Continuity Planning homepage contains several informative links.
- http://www.linux-backup.net/ — An interesting overview of many backup-related issues.
- http://www.linux-mag.com/1999-07/guru_01.html — A good article from Linux Magazine on the more technical aspects of producing backups under Linux.
- http://www.amanda.org/ — The Advanced Maryland Automatic Network Disk Archiver (AMANDA) homepage. Contains pointers to the various AMANDA-related mailing lists and other online resources.
Backups
Backups have two major purposes:
- To permit restoration of individual files
- To permit wholesale restoration of entire file systems
The first purpose is the basis for the typical file restoration request: a user accidentally deletes a file and asks that it be restored from the latest backup. The exact circumstances may vary somewhat, but this is the most common day-to-day use for backups.
The second situation is a system administrator's worst nightmare: for whatever reason, the system administrator is staring at hardware that used to be a productive part of the data center. Now, it is little more than a lifeless chunk of steel and silicon. The thing that is missing is all the software and data you and your users have assembled over the years. Supposedly everything has been backed up. The question is: has it?
And if it has, will you be able to restore it?
Different Data: Different Backup Needs
If you look at the kinds of data[1] processed and stored by a typical computer system, you will find that some of the data hardly ever changes, and some of the data is constantly changing.
The pace at which data changes is crucial to the design of a backup procedure. There are two reasons for this:
- A backup is nothing more than a snapshot of the data being backed up. It is a reflection of that data at a particular moment in time.
- Data that changes infrequently can be backed up infrequently, while data that changes often must be backed up more frequently.
System administrators that have a good understanding of their systems, users, and applications should be able to quickly group the data on their systems into different categories. However, here are some examples to get you started:
- Operating System
This data normally only changes during upgrades, the installation of bug fixes, and any site-specific modifications.
Should you even bother with operating system backups? This is a question that many system administrators have pondered over the years. On the one hand, if the installation process is relatively easy, and if the application of bugfixes and customizations are well documented and easily reproducible, reinstalling the operating system may be a viable option.
On the other hand, if there is the least doubt that a fresh installation can completely recreate the original system environment, backing up the operating system is the best choice, even if the backups are performed much less frequently than the backups for production data. Occasional operating system backups also come in handy when only a few system files must be restored (for example, due to accidental file deletion).
- Application Software
This data changes whenever applications are installed, upgraded, or removed.
- Application Data
This data changes as frequently as the associated applications are run. Depending on the specific application and your organization, this could mean that changes take place second-by-second or once at the end of each fiscal year.
- User Data
This data changes according to the usage patterns of your user community. In most organizations, this means that changes take place all the time.
Based on these categories (and any additional ones that are specific to your organization), you should have a pretty good idea concerning the nature of the backups that are needed to protect your data.
You should keep in mind that most backup software deals with data on a directory or file system level. In other words, your system's directory structure plays a part in how backups will be performed. This is another reason why it is always a good idea to carefully consider the best directory structure for a new system and group files and directories according to their anticipated usage.
Backup Software: Buy Versus Build
In order to perform backups, it is first necessary to have the proper software. This software must not only be able to perform the basic task of making copies of bits onto backup media, it must also interface cleanly with your organization's personnel and business needs. Some of the features to consider when reviewing backup software include:
- Schedules backups to run at the proper time
- Manages the location, rotation, and usage of backup media
- Works with operators (and/or robotic media changers) to ensure that the proper media is available
- Assists operators in locating the media containing a specific backup of a given file
As you can see, a real-world backup solution entails much more than just scribbling bits onto your backup media.
Most system administrators at this point look at one of two solutions:
- Purchase a commercially-developed solution
- Create an in-house developed backup system from scratch (possibly integrating one or more open source technologies)
Each approach has its good and bad points. Given the complexity of the task, an in-house solution is not likely to handle some aspects (such as media management, or have comprehensive documentation and technical support) very well. However, for some organizations, this might not be a shortcoming.
A commercially-developed solution is more likely to be highly functional, but may also be overly-complex for the organization's present needs. That said, the complexity might make it possible to stick with one solution even as the organization grows.
As you can see, there is no clear-cut method for deciding on a backup system. The only guidance that can be offered is to ask you to consider these points:
- Changing backup software is difficult; once implemented, you will be using the backup software for a long time. After all, you will have long-term archive backups that be able to read. Changing backup software means either keep the original software around (to access the archive backups), or convert your archive backups to be compatible with the new software.
Depending on the backup software, the effort involved in converting archive backups may be as straightforward (though time-consuming) as running the backups through an already-existing conversion program, or it may require reverse-engineering the backup format and writing custom software to perform the task.
- The software must be 100% reliable — it must back up what it is supposed to, when it is supposed to.
- When the time comes to restore any data — whether a single file or an entire file system — the backup software must be 100% reliable.
Types of Backups
If you were to ask a person that was not familiar with computer backups, most would think that a backup was just an identical copy of all the data on the computer. In other words, if a backup was created Tuesday evening, and nothing changed on the computer all day Wednesday, the backup created Wednesday evening would be identical to the one created on Tuesday.
While it is possible to configure backups in this way, it is likely that you would not. To understand more about this, we first need to understand the different types of backups that can be created. They are:
- Full backups
- Incremental backups
- Differential backups
Full Backups
The type of backup that was discussed at the beginning of this section is known as a full backup. A full backup is a backup where every single file is written to the backup media. As noted above, if the data being backed up never changes, every full backup being created will be the same.
That similarity is due to the fact that a full backup does not check to see if a file has changed since the last backup; it blindly writes everything to the backup media whether it has been modified or not.
This is the reason why full backups are not done all the time — every file is written to the backup media. This means that a great deal of backup media is used even if nothing has changed. Backing up 100 gigabytes of data each night when maybe 10 megabytes worth of data has changed is not a sound approach; that is why incremental backups were created.
Incremental Backups
Unlike full backups, incremental backups first look to see whether a file's modification time is more recent than its last backup time. If it is not, the file has not been modified since the last backup and can be skipped this time. On the other hand, if the modification date is more recent than the last backup date, the file has been modified and should be backed up.
Incremental backups are used in conjunction with a regularly-occurring full backup (for example, a weekly full backup, with daily incrementals).
The primary advantage gained by using incremental backups is that the incremental backups run more quickly than full backups. The primary disadvantage to incremental backups is that restoring any given file may mean going through one or more incremental backups until the file is found. When restoring a complete file system, it is necessary to restore the last full backup and every subsequent incremental backup.
In an attempt to alleviate the need to go through every incremental backup, a slightly different approach was implemented. This is known as the differential backup.
Differential Backups
Differential backups are similar to incremental backups in that both backup only modified files. However, differential backups are cumulative — in other words, with a differential backup, once a file has been modified it will continue to be included in all subsequent differential backups (until the next, full backup, of course).
This means that each differential backup contains all the files modified since the last full backup, making it possible to perform a complete restoration with only the last full backup and the last differential backup.
Like the backup strategy used with incremental backups, differential backups normally follow the same approach: a single periodic full backup followed by more frequent differential backups.
The affect of using differential backups in this way is that the differential backups tend to grow a bit over time (assuming different files are modified over the time between full backups). This places differential backups somewhere between incremental backups and full backups in terms of backup media utilization and backup speed, while often providing faster single-file and complete restorations (due to fewer backups to search/restore).
Given these characteristics, differential backups are worth careful consideration.
Backup Media
We have been very careful to use the term "backup media" throughout the previous sections. There is a reason for that. Most experienced system administrators usually think about backups in terms of reading and writing tapes, but today there are other options.
At one time, tape devices were the only removable media devices that could reasonably be used for backup purposes. However, this has changed. In the following sections we look at the most popular backup media, and review their advantages as well as their disadvantages.
Tape
Tape was the first widely-used removable data storage medium. It has the benefits of low media cost and reasonably-good storage capacity. However, tape has some disadvantages — it is subject to wear, and data access on tape is sequential in nature.
These factors mean that it is necessary to keep track of tape usage (retiring tapes once they have reached the end of their useful life), and that searching for a specific file on tape can be a lengthy proposition.
On the other hand, tape is one of the most inexpensive mass storage media available, and it has a long history of reliability. This means that building a good-sized tape library need not consume a large part of your budget, and you can count on it being usable now and in the future.
Disk
In years past, disk drives would never have been used as a backup medium. However, storage prices have dropped to the point where, in some cases, using disk drives for backup storage does make sense.
The primary reason for using disk drives as a backup medium would be speed. There is no faster mass storage medium available. Speed can be a critical factor when your data center's backup window is short, and the amount of data to be backed up is large.
But disk storage is not the ideal backup medium, for a number of reasons:
- Disk drives are not normally removable. One key factor to an effective backup strategy is to get the backups out of your data center and into off-site storage of some sort. A backup of your production database sitting on a disk drive two feet away from the database itself is not a backup; it is a copy. And copies are not very useful should the data center and its contents (including your copies) be damaged or destroyed by some unfortunate set of circumstances.
- Disk drives are expensive (at least compared to other backup media). There may be situations where money truly is no object, but in all other circumstances, the expenses associated with using disk drives for backup mean that the number of backup copies must be kept low to keep the overall cost of backups low. Fewer backup copies mean less redundancy should a backup not be readable for some reason.
- Disk drives are fragile. Even if you spend the extra money for removable disk drives, their fragility can be a problem. If you drop a disk drive, you have lost your backup. It is possible to purchase specialized cases that can reduce (but not entirely eliminate) this hazard, but that makes an already-expensive proposition even more so.
- Disk drives are not archival media. Even assuming you are able to overcome all the other problems associated with performing backups onto disk drives, you should consider the following. Most organizations have various legal requirements for keeping records available for certain lengths of time. The chance of getting usable data from a 20-year-old tape is much greater than the chance of getting usable data from a 20-year-old disk drive. For instance, would you still have the hardware necessary to connect it to your system? Another thing to consider is that a disk drive is much more complex than a tape cartridge. When a 20-year-old motor spins a 20-year-old disk platter, causing 20-year-old read/write heads to fly over the platter surface, what are the chances that all these components will work flawlessly after sitting idle for 20 years?
Some data centers back up to disk drives and then, when the backups have been completed, the backups are written out to tape for archival purposes. This allows for the fastest possible backups during the backup window. Writing the backups to tape can then take place during the remainder of the business day; as long as the "taping" finishes before the next day's backups are done, time is not an issue.
All this said, there are still some instances where backing up to disk drives might make sense. In the next section we see how they can be combined with a network to form a viable (if expensive) backup solution.
Network
By itself, a network cannot act as backup media. But combined with mass storage technologies, it can serve quite well. For instance, by combining a high-speed network link to a remote data center containing large amounts of disk storage, suddenly the disadvantages about backing up to disks mentioned earlier are no longer disadvantages.
By backing up over the network, the disk drives are already off-site, so there is no need for transporting fragile disk drives anywhere. With sufficient network bandwidth, the speed advantage you can get from backing up to disk drives is maintained.
However, this approach still does nothing to address the matter of archival storage (though the same "spin off to tape after the backup" approach mentioned earlier can be used). In addition, the costs of a remote data center with a high-speed link to the main data center make this solution extremely expensive. But for the types of organizations that need the kind of features this solution can provide, it is a cost they will gladly pay.
Storage of Backups
Once the backups are complete, what happens then? The obvious answer is that the backups must be stored. However, what is not so obvious is exactly what should be stored — and where.
To answer these questions, we must first consider under what circumstances the backups are to be used. There are three main situations:
- Small, ad-hoc restoration requests from users
- Massive restorations to recover from a disaster
- Archival storage unlikely to ever be used again
Unfortunately, there are irreconcilable differences between numbers 1 and 2. When a user accidentally deletes a file, they would like it back immediately. This implies that the backup media is no more than a few steps away from the system to which the data is to be restored.
In the case of a disaster that necessitates a complete restoration of one or more computers in your data center, if the disaster was physical in nature, whatever it was that destroyed your computers would also have destroyed the backups sitting a few steps away from the computers. This would be a very bad state of affairs.
Archival storage is less controversial; since the chances that it will ever be used for any purpose are rather low, if the backup media was located miles away from the data center there would be no real problem.
The approaches taken to resolve these differences vary according to the needs of the organization involved. One possible approach is to store several days worth of backups on-site; these backups are then taken to more secure off-site storage when newer daily backups are created.
Another approach would be to maintain two different pools of media:
- A data center pool used strictly for ad-hoc restoration requests
- An off-site pool used for off-site storage and disaster recovery
Of course, having two pools implies the need to run all backups twice or to make a copy of the backups. This can be done, but double backups can take too long, and copying requires multiple backup drives to process the copies (and probably a dedicated system to actually perform the copy).
The challenge for a system administrator is to strike a balance that adequately meets everyone's needs, while ensuring that the backups are available for the worst of situations.
Restoration Issues
While backups are a daily occurrence, restorations are normally a less frequent event. However, restorations are inevitable; they will be necessary, so it is best to be prepared.
The important thing to do is to look at the various restoration scenarios detailed throughout this section and determine ways to test your ability to actually carry them out. And keep in mind that the hardest one to test is also the most critical one.
Restoring From Bare Metal
The phrase "restoring from bare metal" is a system administrator's way of describing the process of restoring a complete system backup onto a computer with absolutely no data of any kind on it — no operating system, no applications, nothing.
Overall, there are two basic approaches to bare metal restorations:
- Reinstall, followed by restore
Here the base operating system is installed just as if a brand-new computer were being initially set up. Once the operating system is in place and configured properly, the remaining disk drives can be partitioned and formatted, and all backups restored from backup media.
- Rescue disks
A rescue disk is bootable media of some kind (often a CD-ROM) that contains a minimal system environment, able to perform most basic system administration tasks. The rescue disk environment contains the necessary utilities to partition and format disk drives, the device drivers necessary to access the backup device, and the software necessary to restore data from the backup media.
Some computers have the ability to create bootable backup tapes and to actually boot from them to start the restoration process. However, this capability is not available to all computers. Most notably, computers based on the PC architecture do not lend themselves to this approach.
Testing Backups
Every type of backup should be tested on a periodic basis to make sure that data can be read from it. It is a fact that sometimes backups are performed that are, for one reason or another, unreadable. The unfortunate part in all this is that many times it is not realized until data has been lost and must be restored from backup.
The reasons for this can range from changes in tape drive head alignment, misconfigured backup software, and operator error. No matter what the cause, without periodic testing you cannot be sure that you are actually generating backups from which data can be restored at some later time.
We are using the term data in this section to describe anything that is processed via backup software. This includes operating system software, application software, as well as actual data. No matter what it is, as far as backup software is concerned, it is all data.
Disaster Recovery
As a quick thought experiment, the next time you are in your data center, look around, and imagine for a moment that it is gone. And not just the computers. Imagine that the entire building no longer exists. Next, imagine that your job is to get as much of the work that was being done in the data center going in some fashion, some where, as soon as possible. What would you do?
By thinking about this, you have taken the first step of disaster recovery. Disaster recovery is the ability to recover from an event impacting the functioning of your organization's data center as quickly and completely as possible. The type of disaster may vary, but the end goal is always the same.
The steps involved in disaster recovery are numerous and wide-ranging. Here is a high-level overview of the process, along with key points to keep in mind.
Creating, Testing, and Implementing a Disaster Recovery Plan
A backup site is vital, but it is still useless without a disaster recovery plan. A disaster recovery plan dictates every facet of the disaster recovery process, including but not limited to:
- What events denote possible disasters
- What people in the organization have the authority to declare a disaster and thereby put the plan into effect
- The sequence of events necessary to prepare the backup site once a disaster has been declared
- The roles and responsibilities of all key personnel with respect to carrying out the plan
- An inventory of the necessary hardware and software required to restore production
- A schedule listing the personnel that will be staffing the backup site, including a rotation schedule to support ongoing operations without burning out the disaster team members
- The sequence of events necessary to move operations from the backup site to the restored/new data center
Disaster recovery plans often fill multiple looseleaf binders. This level of detail is vital because in the event of an emergency, the plan may well be the only thing left from your previous data center (other than the last off-site backups, of course) to help you rebuild and restore operations.
While disaster recovery plans should be readily available at your workplace, copies should also be stored off-site. This way, a disaster that destroys your workplace will not take every copy of the disaster recovery plan with it. A good place to store a copy is your off-site backup storage location. If it does not violate your organization's security policies, copies may also be kept in key team members' homes, ready for instant use.
Such an important document deserves serious thought (and possibly professional assistance to create).
And once such an important document is created, the knowledge it contains must be tested periodically. Testing a disaster recovery plan entails going through the actual steps of the plan: going to the backup site and setting up the temporary data center, running applications remotely, and resuming normal operations after the "disaster" is over. Most tests do not attempt to perform 100% of the tasks in the plan; instead a representative system and application is selected to be relocated to the backup site, put into production for a period of time, and returned to normal operation at the end of the test.
Although it is an overused phrase, a disaster recovery plan must be a living document; as the data center changes, the plan must be updated to reflect those changes. In many ways, an out-of-date disaster recovery plan can be worse than no plan at all, so make it a point to have regular (quarterly, for example) reviews and updates of the plan.
Backup Sites: Cold, Warm, and Hot
One of the most important aspects of disaster recovery is to have a location from which the recovery can take place. This location is known as a backup site. In the event of a disaster, a backup site is where your data center will be recreated, and where you will operate from, for the length of the disaster.
There are three different types of backup sites:
- Cold backup sites
- Warm backup sites
- Hot backup sites
Obviously these terms do not refer to the temperature of the backup site. Instead, they refer to the effort required to begin operations at the backup site in the event of a disaster.
A cold backup site is little more than an appropriately configured space in a building. Everything required to restore service to your users must be procured and delivered to the site before the process of recovery can begin. As you can imagine, the delay going from a cold backup site to full operation can be substantial.
Cold backup sites are the least expensive sites.
A warm backup site is already stocked with hardware representing a reasonable facsimile of that found in your data center. To restore service, the last backups from your off-site storage facility must be delivered, and bare metal restoration completed, before the real work of recovery can begin.
Hot backup sites have a virtual mirror image of your current data center, with all systems configured and waiting only for the last backups of your user data from your off-site storage facility. As you can imagine, a hot backup site can often be brought up to full production in no more than a few hours.
A hot backup site is the most expensive approach to disaster recovery.
Backup sites can come from three different sources:
- Companies specializing in providing disaster recovery services
- Other locations owned and operated by your organization
- A mutual agreement with another organization to share data center facilities in the event of a disaster
Each approach has its good and bad points. For example, contracting with a disaster recovery firm often gives you access to professionals skilled in guiding organizations through the process of creating, testing, and implementing a disaster recovery plan. As you might imagine, these services do not come without cost.
Using space in another facility owned and operated by your organization can be essentially a zero-cost option, but stocking the backup site and maintaining its readiness is still an expensive proposition.
Crafting an agreement to share data centers with another organization can be extremely inexpensive, but long-term operations under such conditions are usually not possible, as the host's data center must still maintain their normal production, making the situation strained at best.
In the end, the selection of a backup site is a compromise between cost and your organization's need for the continuation of production.
Hardware and Software Availability
Your disaster recovery plan must include methods of procuring the necessary hardware and software for operations at the backup site. A professionally-managed backup site may already have everything you need (or you may need to arrange the procurement and delivery of specialized materials the site does not have available); on the other hand, a cold backup site means that a reliable source for every single item must be identified. Often organizations work with manufacturers to craft agreements for the speedy delivery of hardware and/or software in the event of a disaster.
Availability of Backups
When a disaster is declared, it is necessary to notify your off-site storage facility for two reasons:
- To have the last backups brought to the backup site
- To arrange regular backup pickup and dropoff to the backup site (in support of normal backups at the backup site)
In the event of a disaster, the last backups you have from your old data center are vitally important. Consider having copies made before anything else is done, with the originals going back off-site as soon as possible.
Network Connectivity to the Backup Site
A data center is not of much use if it is totally disconnected from the rest of the organization that it serves. Depending on the disaster recovery plan and the nature of the disaster itself, your user community might be located miles away from the backup site. In these cases, good connectivity is vital to restoring production.
Another kind of connectivity to keep in mind is that of telephone connectivity. You must ensure that there are sufficient telephone lines available to handle all verbal communication with your users. What might have been a simple shout over a cubicle wall may now entail a long-distance telephone conversation; so plan on more telephone connectivity than might at first appear necessary.
Backup Site Staffing
The problem of staffing a backup site is multi-dimensional. One aspect of the problem is determining the staffing required to run the backup data center for as long as necessary. While a skeleton crew may be able to keep things going for a short period of time, as the disaster drags on more people will be required to maintain the effort needed to run under the extraordinary circumstances surrounding a disaster.
This includes ensuring that personnel have sufficient time off to unwind and possibly travel back to their homes. If the disaster was wide-ranging enough to affect peoples' homes and families, additional time must be allotted to allow them to manage their own disaster recovery. Temporary lodging near the backup site will be necessary, along with the transportation required to get people to and from the backup site and their lodgings.
Often a disaster recovery plan includes on-site representative staff from all parts of the organization's user community. This depends on the ability of your organization to operate with a remote data center. If user representatives must work at the backup site, similar accommodations must be made available for them, as well.
Moving Back Toward Normalcy
Eventually, all disasters end. The disaster recovery plan must address this phase as well. The new data center must be outfitted with all the necessary hardware and software; while this phase often does not have the time-critical nature of the preparations made when the disaster was initially declared, backup sites cost money every day they are in use, so economic concerns dictate that the switchover take place as quickly as possible.
The last backups from the backup site must be made and delivered to the new data center. After they are restored onto the new hardware, production can be switched over to the new data center.
At this point the backup data center can be decommissioned, with the disposition of all temporary hardware dictated by the final section of the plan. Finally, a review of the plan's effectiveness is held, with any changes recommended by the reviewing committee integrated into an updated version of the plan.
Red Hat Linux-Specific Information
Software Support
As a software vendor, Red Hat does have a number of support offerings for its products, including Red Hat Linux. You are using the most basic support tool right now by reading this manual. Documentation for Red Hat Linux is available on the Red Hat Linux Documentation CD (which can be installed on your Red Hat Linux system for fast access), in printed form in the various Red Hat Linux boxed products, and in electonic form at http://www.redhat.com/docs/.
Self support options are available via the many mailing lists hosted by Red Hat (available at https://listman.redhat.com/mailman/listinfo/). These mailing lists take advantage of combined knowledge of the Red Hat Linux user community; in addition, many lists are monitored by Red Hat personnel, who contribute as time permits. In addition, a knowledge base is available on the Red Hat website; it is available from Red Hat's main support page at http://www.redhat.com/apps/support/.
More comprehensive support options exist; information on them can be found on the Red Hat website.
Backup Technologies
Red Hat Linux comes with several different programs for backing up and restoring data. By themselves, these utility programs do not constitute a complete backup solution. However, they can be used as the nucleus of such a solution.
As noted in Section 8.2.6.1 Restoring From Bare Metal, most computers based on the standard PC architecture do not possess the necessary functionality to boot directly from a backup tape. Consequently, Red Hat Linux is not capable of performing a tape boot when running on such hardware.
However, it is also possible to use your Red Hat Linux CD-ROM as a rescue disk; for more information see the chapter on rescue mode in the Red Hat Linux Customization Guide.
tar
The tar utility is well known among UNIX system administrators. It is the archiving method of choice for sharing ad-hoc bits of source code and files between systems. The tar implementation included with Red Hat Linux is GNU tar, one of the more feature-rich tar implementations.
Using tar, backing up the contents of a directory can be as simple as issuing a command similar to the following:
tar cf /mnt/backup/home-backup.tar /home/This command creates an archive file called home-backup.tar in /mnt/backup/. The archive contains the contents of the /home/ directory.
The resulting archive file will be nearly as large as the data being backed up. Depending on the type of data being backed up, compressing the archive file can result in significant size reductions. The archive file can be compressed by adding a single option to the previous command:
tar czf /mnt/backup/home-backup.tar.gz /home/The resulting home-backup.tar.gz archive file is now gzip compressed [1].
There are many other options to tar; to learn more about them, read the tar(1) man page.
cpio
The cpio utility is another traditional UNIX program. It is an excellent general-purpose program for moving data from one place to another and, as such, can serve well as a backup program.
The behavior of cpio is a bit different from tar. Unlike tar, cpio reads the names of the files it is to process via standard input. A common method of generating a list of files for cpio is to use programs such as find whose output is then piped to cpio:
find /home/ | cpio -o > /mnt/backup/home-backup.cpioThis command creates a cpio archive file (containing the everything in /home/) called home-backup.cpio and residing in the /mnt/backup directory.
Because find has a rich set of file selection tests, sophisticated backups can be created. For example, the following command performs a backup of only those files that have not been accessed within the past year:
find /home/ -atime +365 | cpio -o > /mnt/backup/home-backup.cpioThere are many other options to cpio (and find); to learn more about them read the cpio(1) and find(1) man pages.
dump/restore: Not Recommended!
The dump and restore programs are Linux equivalents to the UNIX programs of the same name. As such, many system administrators with UNIX experience may feel that dump and restore are viable candidates for a good backup program under Red Hat Linux. Unfortunately, the design of the Linux kernel has moved ahead of dump's design. The use of dump/restore is strongly discouraged.
The Advanced Maryland Automatic Network Disk Archiver (AMANDA)
AMANDA is a client/server based backup application produced by the University of Maryland. By having a client/server architecture, a single backup server (normally a fairly powerful system with a great deal of free space on fast disks and configured with the desired backup device) can back up many client systems, which need nothing more than the AMANDA client software.
This approach to backups makes a great deal of sense, as it concentrates those resources needed for backups in one system, instead of requiring additional hardware for every system requiring backup services. AMANDA's design also serves to centralize the administration of backups, making the system administrator's life that much easier.
The AMANDA server manages a pool of backup media and rotates usage through the pool in order to ensure that all backups are retained for the administrator-dictated retention period. All media is pre-formatted with data that allows AMANDA to detect whether the proper media is available or not. In addition, AMANDA can be interfaced with robotic media changing units, making it possible to completely automate backups.
AMANDA can use either tar or dump to do the actual backups (although under Red Hat Linux using tar is preferable, due to the issues with dump raised in Section 8.4.2.3 dump/restore: Not Recommended!). As such, AMANDA backups do not require AMANDA in order to restore files — a decided plus.
In operation, AMANDA is normally scheduled to run once a day during the data center's backup window. The AMANDA server connects to the client systems and directs the clients to produce estimated sizes of the backups to be done. Once all the estimates are available, the server constructs a schedule, automatically determining the order in which systems are to be backed up.
Once the backups actually start, the data is sent over the network from the client to the server, where it is stored on a holding disk. Once a backup is complete, the server starts writing it out from the holding disk to the backup media. At the same time, other clients are sending their backups to the server for storage on the holding disk. This results in a continuous stream of data available for writing to the backup media. As backups are written to the backup media, they are deleted from the server's holding disk.
Once all backups have been completed, the system administrator is emailed a report outlining the status of the backups, making review easy and fast.
Should it be necessary to restore data, AMANDA contains a utility program that allows the operator to identify the file system, date, and file name(s). Once this is done, AMANDA identifies the correct backup media and then locates and restores the desired data. As stated earlier, AMANDA's design also makes it possible to restore data even without AMANDA's assistance, although identification of the correct media would be a slower, manual process.
This section has only touched upon the most basic AMANDA concepts. If you would like to do more research on AMANDA, you can start with the amanda(8) man page.
The .gz extension is traditionally used to signify that the file has been compressed with gzip. Sometimes .tar.gz is shortened to .tgz to keep file names reasonably sized.