Atom feed of this document

 Chapter 1. Introduction to OpenStack High Availability

High Availability systems seek to minimize two things:

System downtime

Occurs when a user-facing service is unavailable beyond a specified maximum amount of time.

Data loss

Accidental deletion or destruction of data.

Most high availability systems guarantee protection against system downtime and data loss only in the event of a single failure. However, they are also expected to protect against cascading failures, where a single failure deteriorates into a series of consequential failures.

A crucial aspect of high availability is the elimination of single points of failure (SPOFs). A SPOF is an individual piece of equipment or software which will cause system downtime or data loss if it fails. In order to eliminate SPOFs, check that mechanisms exist for redundancy of:

  • Network components, such as switches and routers

  • Applications and automatic service migration

  • Storage components

  • Facility services such as power, air conditioning, and fire protection

In the event that a component fails and a back-up system must take on its load, most high availability systems will replace the failed component as quickly as possible to maintain necessary redundancy. This way time spent in a degraded protection state is minimized.

Most high availability systems will fail in the event of multiple independent (non-consequential) failures. In this case, most systems will protect data over maintaining availability.

High-availability systems typically achieve an uptime percentage of 99.99% or more, which roughly equates to less than an hour of cumulative downtime per year. In order to achieve this, high availability systems should keep recovery times after a failure to about one to two minutes, sometimes significantly less.

OpenStack currently meets such availability requirements for its own infrastructure services, meaning that an uptime of 99.99% is feasible for the OpenStack infrastructure proper. However, OpenStack does not guarantee 99.99% availability for individual guest instances.

Preventing single points of failure can depend on whether or not a service is stateless.

Questions? Discuss on
Found an error? Report a bug against this page

loading table of contents...