The Day the Cloud Died: Amazon’s Outage
Sometime around 1:41 AM PDT Amazon began to notice high error rates in its Elastic Block Storage and connectivity issues with reaching Elastic Compute Cloud (EC2) instances at its Virginia datacenter.
Unfortunately for Amazon and the companies that rely upon it, the problem was not quickly resolved and many woke up the next morning to find that some of their favorite services and tools were not working. Sites like Reddit, Quora, HootSuite and FourSquare were all knocked either partially or totally offline.
However, there were also thousands of other companies that suffered, including Waze, a social GPS navigation tool that had its navigation system taken offline for the morning, and Heroku, a popular Ruby on Rails hosting platform.
Companies big and small were impacted. However, according to reports, one of Amazon’s most prominent customers, Netflix, was able to escape any serious problems as it is built to survive the loss of a datacenter without any downtime.
Likewise, customers not based out of the Virginia datacenter were not affected and neither were customers of other Amazon solutions such as S3 or Cloudfront. However, the loss of EC2 at that datacenter seemed to be a surgical strike taking out a lot of startups and popular websites that got a great deal of media attention.
Unfortunately, even now, over 24 hours after the problem was detected, it doesn’t appear to be completely fixed. Though many of the sites and services have recovered, slow performance and outages still plague others. Even right now, Amazon does not have an estimated time for a complete fix.
In short, the outage, for some, is continuing and the whole debacle could spell disaster for the concept of cloud hosting and on encouraging others to rely on the Internet more in their daily lives.
What We Know About Amazon’s Outage
Amazon is being tight-lipped about the cause of the outage and is primarily communicating directly with its customers rather than making public statements. However, by looking at what happened, a few facts have been pieced together.
First, the problem was clearly not as isolated as it should have been and, for whatever reason, Amazon was unable to to use its own safeguards against these types of problems.
Amazon Web Services has two different concepts when it comes to availabilirt. Regions, which include U.S. East (the one that failed), U.S. West, Europe and two in Asia. Within each of those regions is a series of multiple Availability Zones (AZs) that physically separate multiple instances of their cloud across the same region. The idea is that, if one AZ goes down or is destroyed, another picks up seamlessly.
Unfortunately, that didn’t happen in this case.
The failure at the U.S. East region was across multiple AZs, which should not have been able to happen. Many companies built their entire infrastructure, perhaps foolishly, around the idea that a single AWS region can not go down like this; this is why so many services, especially startups, were so deeply impacted.
The worse news was that the event took place in the U.S. East Region. The cheapest of the regions, the first to get new features and the primary one that Amazon drops new customers on, U.S. East is by far the most popular among startups, especially since it’s closest to the bulk of the U.S. population. So, even though only one region was affected out of five, it was by far the most popular, especially for U.S.-based startups and companies.
The result is that the outage was, and still is in part, a perfect storm across multiple AZs in the most popular region. Though the entire Cloud may not be down, even in the Amazon sense of the word, it can certainly feel like it and the media has jumped on the story.
That, in turn, may be the biggest problem of all.
As of this writing, Google News is reporting some 1,055 stories for the search term “Amazon Cloud Outage” and that only covers the mainstream media. Many times the number of blogs and other sites have written about it as well.
The hardest part in convincing others to trust the cloud with their data, their services, etc. is reliability. The Internet is simply not treated like a utility and most people don’t rely on it to be there when they need it.
So when trying to convince people to move something of value to them to the cloud, as Amazon did with their Amazon Cloud Files and Cloud Player, it can be a tough sell.
Though Amazon, as well as other cloud services, have had great reliability over the years, yesterday’s lengthy outage was a very public black mark on their reputation and the reputation of cloud services everywhere.
After all, Amazon is generally thought of as the “best” and “safest” cloud service, at least as far as perception, so other companies will likely suffer as much as Amazon from this outage.
Amazon Web Services probably won’t go out of business because of this, but smaller companies might. Not because they were at fault, but because the whole idea of trusting the cloud will be an even tougher sell.
For the cloud hosting industry and cloud-based services to fully recover from this, there are three things that need to happen and quickly:
- An Amazon Alternative: A company needs to rise up and become a viable alternative to Amazon in every regard. Whether it’s a giant like Google or a startup, someone needs to challenge them in terms of size and reputation, creating a viable alternative.
- Compelling New Reasons: People, both companies and individuals, are going to be timid about relying on the cloud so cloud-based services, like Amazon, are going to have to find compelling new reasons to do so.
- A Period of Stability: People have short memories for the most part and, if cloud-based computing can have a long period of relative stability then most will forget about this issue pretty quickly, especially as the uptime percentages tick closer to what one would consider a “normal” range.
In the end, what took place was far more than Amazon giving itself a black eye or a few sites going down. An entire, budding industry has taken a huge hit. It’s an industry that holds a great deal of promise to change the way we use the Web and computers, but it’s also an industry based heavily on trust, something that’s greatly damaged right now.
Though cloud hosting and computing can recover, it’s going to take a long time to do so. The damage done by more than 24 hours of downtime can not be undone in a week, a month or even a year, not when you’ve built your entire promise on high availability.
It’s going to be a long road to get back to where the industry was Wednesday night, but it can be done with a lot of work and a little patience.