6.1.3 Failover

The proposition to have multiple servers (potentially on multiple independent machines) naturally leads to the potential for the system to provide failover. That is, if any one machine or server in the system were to fail for any reason, the system should continue to operate with the remaining servers. The load balancing property should ensure that the client load gets redistributed to the remaining servers, each of which will take on a proportionally higher percentage of the total load. Of course, such an arrangement assumes that the system is designed with some degree of overcapacity, so that the remaining servers are indeed sufficient to process the total expected client load.

Ideally, the failover aspect should be totally transparent to clients of the system. When a server fails, any client that is currently interacting with that server should be automatically redirected to one of the remaining servers, without any interruption of service and without requiring any special action on the part of that client. In practice, however, most failover solutions might not be completely transparent. For example, a client that is currently in the middle of an operation when a server fails might receive an error from that operation, and might be required a retry (at which point the client would be connected to another, still available server). Or the client might observe a pause or delay in processing, before the processing of its requests resumes automatically with a different server. The important point in failover is that each client, and the set of clients as a whole, is able to eventually continue to take advantage of the system and receive service, even if some of the servers fail and become unavailable. Conversely, when a previously failed server becomes available again, the system might transparently start using that server again to process a portion of the total client load.

The failover aspect is also sometimes called fault tolerance, in that it allows the system to survive a variety of failures or faults. It should be noted, however, that failover is only one technique in the much broader field of fault tolerance, and that no such technique can make a system 100% safe against every possible failure. The goal is to greatly minimize the probability of system failure, but keep in mind that the possibility of system failure cannot be completely eliminated.

Note that in the context of discussions on failover, the term server often refers to a physical machine. However, WebSphere vertical scaling also allows for one server process on a given machine to fail independently, while other processes on that same machine continue to operate normally.
xxxx