High availability is the expectation that a system will operate continuously for a significant span of time. For example, with 8,760 hours in a year, 99% availability signals over 7 hours of downtime a month and 88 hours of downtime over the course of that year. In turn, 99.9% availability—“three nines”—adds up to over eight hours of unplanned downtime, while 99.99% (four nines) translates into under an hour.
When it comes to availability, many companies focus on the possibility of single-node failures. A common workaround for such failures is to run an active/active system, a network of independent processing nodes where each node has access to a replicated database so all nodes participate in a common application. Another workaround is an active/passive high-availability cluster, where the second “standby” node is used if the first node fails.
The availability strategy selected typically depends on the layer of the stack in question. It’s fairly easy to run the same content on two web servers as active/active, whereas an active/passive cluster is more commonly used with databases since managing multiple active database masters can be a challenge. Regardless, to achieve high availability a system should accommodate failure through the right amount of redundancy.
In short, if your system isn’t functioning, you’re not making money. The goal, of course, is to stay up and running. The goal seems straightforward enough, but the reality is much more complex.
What Can Go Wrong? Everything.
For starters, it’s critical to identify all parts of your system—a single machine, one data center, one network in one location, a single cloud provider—that can fail and, as previously mentioned, put the right redundancies in place.
Single-machine failures are typically inexpensive to protect against and quick to recover from. To increase availability, you can deploy to data centers in multiple Availability Zones, where several servers are grouped into multiple distinct locations. Launching instances in separate Availability Zones can protect applications from single-location failure. Above Availability Zones are regions, with data centers located in different geographic areas, such as east coast and west coast. With multiple regions, a failure on one coast usually doesn’t impact availability.
The strategy for a mid-sized company might look something like this:
- Use Amazon Relational Database Service to manage database availability.
- Replicate the database into multiple Availability Zones to increase uptime.
- Deploy applications into multiple Availability Zones.
- ensure all applications are deployed to at least two nodes.
How Many Nines Do You Need?
That’s the real question. The answer depends on your requirements. While some organizations set their sights on 100% availability, most systems don’t need to hit such heights. Nuclear reactors, missile defense systems, and stock exchanges have a high cost of failure and need high reliability, but web applications may not need as much.
Most smaller businesses don’t need to invest in high levels of fault tolerance, which can require immense hardware and engineering resources. Every time you add a nine, costs rise exponentially.
Two important questions:
- What’s the dollar value per hour of downtime, and how does that cost compare to the cost of offsetting the problem?
- How much availability does your system realistically require? Do you need five nines? Three nines? Will 99% suffice?
Start Small, Grow Smart
Companies usually start with a few nodes in one Availability Zone and then grow to two or three Availability Zones. As the cost of downtime increases, companies start to look at more expensive options to increase availability, such as multiple regions and multiple cloud providers.
Only through risk analysis can you make smart investment decisions that balance costs with your business needs and your risk tolerance. The key is to evaluate the risk of failure and the associated costs and then determine the availability your business – and your clients – demand.