How a Multi-Cloud Strategy Protects Against Failure: When Cloud’s Crash, What’s Your Plan?

Today, tons of websites and mobile apps rely on Amazon S3 and other cloud-based services. Clouds have proven convenient and cost-effective. But  to guard against when clouds fail Cloud Foundry Foundation CTO Chip Childers suggests a multi-cloud approach.

Tags: AWS, Cloud Foundry, EC2, failure, multi-cloud, S3,

Chip Childers,Cloud Foundry Foundation
Chip Childers
Cloud Foundry Foundation

"The key take-away from this AWS outage is that so many companies reliant on Amazon services didn’t plan for this eventuality."

Integration & Web APIs
Enterprise-Grade Integration Across Cloud and On-Premise
Online Conference

When one region of Amazon’s Simple Storage Service (S3) went offline earlier this year, the Internet reeled.


No wonder.  The impact of the outage was not limited to the apps that use S3 as a data store. Object storage services, like S3, have become an essential component in many application architectures.  Today, many thousands of websites and mobile applications rely on the S3 service to store and serve up content.


In this case, the Amazon S3 outage was caused by an attempt to debug a slow-performing S3 billing system by removing a few servers. This problem, compounded by user error, forced a restart. Four hours later, S3 was back in business.


This recent incident of a cloud failure, although only for just a few hours, is instructive once one takes the time to understand the inner-workings of clouds – and how one minor fail can trigger subsequent ones.


For example, Amazon Web Services frequently uses its own services to build into higher-level services. So, the S3 failure resulted in a cascading failure of other AWS services that rely on S3 in that region. This included the S3 console (the service it was designed to provide status information about), EC2 new instance launches, and AWS Lambda, Amazon’s serverless computing option.


Add it all up and the cascading impacts added up to a major landslide of disruption.  All told, nearly a fifth of the Internet was affected, costing the companies hit in the hundreds of millions of dollars. It’s interesting to note that while Amazon has 16 regions with multiple data centers, the US-East-1 region was Amazon’s oldest, least expensive to use and supported more traffic than other regions.


Here’s the thing though: Cloud operations is really hard; and operating a service at the scale of AWS is even harder. The teams at Amazon Web Services responsible for building, maintaining and operating their services at ever-growing levels of scale are, by any objective measure, doing an amazing job. This includes the excellent transparency they offered to the public on the root cause and remediation steps they will take to improve for the future.


The key take-away from this AWS outage isn’t simply that Amazon had an outage. It’s that so many companies reliant on Amazon services for business-critical operations didn’t plan for this eventuality. It’s happened before. It will happen again.

There’s really no excuse for being unprepared.


Cloud Services Will Fail – Plan for It

Long before the cloud existed, IT departments designed their enterprises with systems availability in mind. Even back in the day of mainframes, companies popped up to offer disaster recovery. As technology progressed to midrange systems and Intel-based platforms in data centers, disaster recovery evolved to encompass hardware-based availability solutions, as well as clustering technology to make sure databases were spread out across multiple physical devices. We used to spend quite a lot of time thinking about the availability and recoverability of our IT systems.


Then came the cloud.


One of the great benefits you can get from public cloud providers is the enormous amount of engineering time and effort public cloud vendors have put into creating resilient systems. They tout economies of scale, and that scale leads to significant operational experience well beyond what most enterprises have or can hire internally. Quite frankly, there are very few organizations in the world that operate at the scale of Amazon Web Services, yet alone achieve the level of availability and resiliency that AWS has achieved.


However, just because you can outsource significant operational elements to cloud providers without having to deal with the implementation details — whether it’s hardware, or a storage service you access via an API, like S3 — putting applications in the cloud doesn’t eliminate the fact that you have ultimate responsibility for the availability of your business’ systems.


As with hardware failures, cloud services will (and do) fail.


Why a Multi-Cloud Strategy Makes Sense – Steps To Get Started

One thing that makes cloud native application principles so compelling is the concept of designing the architecture to assume failure of the underlying cloud services it relies on. Most discussions of “design for failure” focus on things like EC2 instance availability, and using multiple availability zones when deploying your application. What’s important to note about the S3 failure is that it was a bit different. First, S3 is all about highly durable data storage, not about running a user’s application code. AWS users have grown accustomed to assuming that the level of durability offered by S3 equates to availability. Second, typical information on building resilient applications on AWS focuses on the use of availability zones to ensure fault isolation, but the outage itself was across an entire region.


This is where a multi-cloud strategy can be very useful. Designing your app for a multi-cloud environment means relying on multiple cloud providers, or at least using multiple regions of a single cloud provider. Such a strategy might involve using storage services from Google and Amazon, or Microsoft and Amazon, or multiple AWS regions. The risk of multiple systems being offline decreases the further apart they are, both geographically or offered by different corporations. Working with multiple cloud providers is a strong availability strategy.


Focusing on the S3 example, application architects need to consider how to handle both write and read operations in a multi-cloud way. Write operations are the first item to focus on.


Generally, there are two strategies that can be employed (or combined): real-time mirroring of write operations to multiple storage services and eventual consistency techniques that replicate the data to the secondary service. Depending on your needs, one or both will be appropriate. As for read operations, think about access to the cloud storage services just like you would other microservices that your application depends on. The circuit breaker pattern can be easily employed to fail over read operations from the primary to secondary cloud storage service.


There’s quite a bit more detail involved, but the simplified description is

(1) get your data in at least two storage services and
(2) build your application that works with that data to automatically respond to an outage in one (or more) of the storage services.

Designing for failure isn’t cheap. There’s always a cost to consuming more storage or more cloud services. You have to weigh the cost of extra cloud consumption against the potential loss of business if your app goes offline. To avoid business-damaging downtime, plan for how your apps will handle cloud service failure. Whether you spread your apps across multiple availability zones and/or cloud providers, having a multi-cloud strategy can help you adapt quickly when (not if) failure happens.


As CTO at the Cloud Foundry Foundation, Chip Childers drives technology initiatives to make Cloud Foundry the leading open source application platform for enterprise-class cloud computing.  He has spent 18+ years in large-scale computing and open source software, including as first vice president of Apache Cloudstack. Chip is a sought-after expert speaker on cloud and open source, having presented at OSCON, ApacheCon and the O’Reilly Software Architecture Conference.