7.6.1 The problem

Some sites simply cannot afford any downtime. Even among those that can, when an application is running in production, it is often difficult to schedule any downtime for updates, whether these are application updates, appserver updates, operating system updates, hardware updates or any other kind of change to the system. Compounding this reality is the fact that emergency fixes (such as a newly discovered and fixed security flaw) must be applied as soon as possible for the sake of system integrity, and these often need to bypass the downtime schedule process entirely.

Even when downtime is scheduled to perform an update, there is the issue of rollback. If, despite all of the extremely thorough testing that happens before production, an issue is discovered on the production environment which means that the update needs to be rolled back, how exactly do you do this? Sometimes customers perform a full backup after taking the system down, and this is used as their rollback solution; if something goes wrong, the system is taken down again and the backup is restored. This may work, but again involves downtime, and also precludes the possibility of investigating the "failing" system.