Amazon cloud outage was triggered by configuration error
Company's postmortem and apology wins praise for transparency
Computerworld - Amazon has released a detailed postmortem and mea culpa about the partial outage of its cloud services platform last week and identified the culprit: A configuration error made during a network upgrade.
During this configuration change, a traffic shift "was executed incorrectly," Amazon said, noting that traffic that should have gone to a primary network was routed to a lower capacity one instead. The error occurred at 12:47 p.m. on April 21 and led to a partial outage that lingered through last weekend.
The outage sent a number of prominent Web sites offline, including Quora, Foursquare and Reddit, and renewed an industry-wide debate over the maturity of cloud services.
Amazon posted updates, short and bulletin-like, throughout the outage, but what it offered in its postmortem is entirely different. This nearly 5,700-word document includes a detailed look at what happened, an apology, a credit to affected customers, as well a commitment to improve its customer communications.
Amazon didn't say explicitly whether it was human error that touched off the event, but hints at that possibility when it wrote that "we will audit our change process and increase the automation to prevent this mistake from happening in the future."
The initial mistake, followed by the subsequent increase in network load, exposed a cascading series of issues, including a "re-mirroring storm" with systems continuously searching for a storage space.
Amazon also said in its explanation of the outage that it will work to ensure that it builds software and services that can survive failures.
Matt Stevens, the CTO of AppNeta, a cloud performance network performance management company and an Amazon cloud user, praised Amazon's postmortem for its transparency. "As a technical architect, I thought it was actually amazing how deep they went into it," said Stevens, adding that he wished the company had offered more detail about the initial network change that started the problem.
In terms of the overall issue, Stevens said: "How does anybody who runs their own private data center know how it's going to hold up until you have a massive issue?"
Jim Damoulakis, CTO of GlassHouse Technologies, an enterprise storage services provider, called it "a pretty through postmortem and I think for the most part they are being transparent about it."
Damoulakis said that while Amazon will take steps to keep the problem from happening again -- and to make their availability zones more robust -- customers will ultimately be responsible for having a good disaster recovery plan.
"I think there is blame on both sides," said Justin Alexander, who heads strategic research and development at Hyland Software, an enterprise content management software firm, referring to both Amazon and its customers.
"Clearly, Amazon needs to take accountability for their services. But at the same time there were a variety of customers who were using the EC2 platform that did not suffer any period of unavailability," said Alexander, citing their disaster recovery plans.
Patrick Thibodeau covers SaaS and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at
@DCgov or subscribe to Patrick's RSS feed
. His e-mail address is pthibodeau@computerworld.com.
Cloud Watch
- DHS shifting to cloud, agile development to boost homeland security
- Cloud computing's big debt to NASA
- Coke bottler picks SaaS over SAP
- Inmate data paroled from mainframe
- An end to the free online tax ride nears
- Netflix guts data center in shift to cloud
- Apple, Facebook put Prineville on the map
- Online dating site dumps Amazon cloud services
- Ellison: Oracle will deliver world's 'most comprehensive cloud'
- Microsoft to run Linux on Azure
Read more about Cloud Computing in Computerworld's Cloud Computing Topic Center.
- 12 iPhones Apps That Will Make You a Networking Star
- 10 Careers Robots Are Taking From You
- Big Data Gold Isn't Always Where You Would Expect It
- 6 Tips to Build Your Social Media Strategy
- A walking tour: 33 questions to ask about your company's security
- 15 social media scams
- The 7 elements of a successful security awareness program
- IT Certification Study Tips
- Register for this Computerworld Insider Study Tip guide and gain access to hundreds of premium content articles, cheat sheets, product reviews and more.
- Reduction in deployment time of a service development environment at GMO Media using a private cloud Read this case study to learn how GMO Media achieved a significant reduction in the implementation period of a service development environment using...
- Unleash the Potential of Your Virtual Teams Today's highly mobile, distributed and virtual teams are struggling to reach their full potential. The proliferation of disparate communications tools has created a...
- New Global Research Shows Untapped Potential New global research, conducted by Siemens Enterprise Communications in the fall of 2012, reveals interesting trends on virtual and remote worker habits and...
- The Cloud Threat This white paper outlines the concerns that often prevent midsized enterprises from taking advantage of the Cloud. It also describes how a new,...
- Live Webcast
Storage Validation at Go Daddy: Best Practices from the World's #1 Web Hosting Provider - Storage Validation at Go Daddy: Best Practices from the World's #1 Web Hosting Provider
- B2B Integration on Cloud: Real World Solutions and Technology Advances Watch the webcast with IBM experts to learn about the advancing capabilities and strategic direction for B2B Integration on Cloud.
- How The Cloud Threatens Midsize Enterprises...And What To Do About It A recent study showed 92% of IT pros recognize that moving to the cloud provides a competitive edge, but only 20% plan to... All Cloud Computing White Papers | Webcasts
