Skip the navigation
News

Amazon cloud outage was triggered by configuration error

Company's postmortem and apology wins praise for transparency

April 29, 2011 03:38 PM ET

Computerworld - Amazon has released a detailed postmortem and mea culpa about the partial outage of its cloud services platform last week and identified the culprit: A configuration error made during a network upgrade.

During this configuration change, a traffic shift "was executed incorrectly," Amazon said, noting that traffic that should have gone to a primary network was routed to a lower capacity one instead. The error occurred at 12:47 p.m. on April 21 and led to a partial outage that lingered through last weekend.

The outage sent a number of prominent Web sites offline, including Quora, Foursquare and Reddit, and renewed an industry-wide debate over the maturity of cloud services.

Amazon posted updates, short and bulletin-like, throughout the outage, but what it offered in its postmortem is entirely different. This nearly 5,700-word document includes a detailed look at what happened, an apology, a credit to affected customers, as well a commitment to improve its customer communications.

Amazon didn't say explicitly whether it was human error that touched off the event, but hints at that possibility when it wrote that "we will audit our change process and increase the automation to prevent this mistake from happening in the future."

The initial mistake, followed by the subsequent increase in network load, exposed a cascading series of issues, including a "re-mirroring storm" with systems continuously searching for a storage space.

Amazon also said in its explanation of the outage that it will work to ensure that it builds software and services that can survive failures.

Matt Stevens, the CTO of AppNeta, a cloud performance network performance management company and an Amazon cloud user, praised Amazon's postmortem for its transparency. "As a technical architect, I thought it was actually amazing how deep they went into it," said Stevens, adding that he wished the company had offered more detail about the initial network change that started the problem.

In terms of the overall issue, Stevens said: "How does anybody who runs their own private data center know how it's going to hold up until you have a massive issue?"

Jim Damoulakis, CTO of GlassHouse Technologies, an enterprise storage services provider, called it "a pretty through postmortem and I think for the most part they are being transparent about it."

Damoulakis said that while Amazon will take steps to keep the problem from happening again -- and to make their availability zones more robust -- customers will ultimately be responsible for having a good disaster recovery plan.

"I think there is blame on both sides," said Justin Alexander, who heads strategic research and development at Hyland Software, an enterprise content management software firm, referring to both Amazon and its customers.

"Clearly, Amazon needs to take accountability for their services. But at the same time there were a variety of customers who were using the EC2 platform that did not suffer any period of unavailability," said Alexander, citing their disaster recovery plans.

Patrick Thibodeau covers SaaS and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at Twitter @DCgov or subscribe to Patrick's RSS feed Thibodeau RSS. His e-mail address is pthibodeau@computerworld.com.

Read more about Cloud Computing in Computerworld's Cloud Computing Topic Center.



Additional Resources
Forrester Consulting - Optimizing Users and Applications in a Mobile World
WHITE PAPER
Solving application issues over the WAN requires careful consideration. Based on their independent research, Forrester Consulting offers recommendations on how to tackle application performance issues, insufficient bandwidth and the inability to quickly restore users in a disaster.

Read now.

Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

Cloud Computing White Papers
Utility Storage - The Ideal Platform for Virtual and Cloud Computing
Server virtualization has transformed corporate IT -- companies have enjoyed major cost savings and have gained flexibility and efficiency. But this has also...
Forrester on the Converged Infrastructure
To understand infrastructure and operations (I&O) perceptions of converged infrastructure (CI), Forrester Consulting surveyed 200 I&O decision-makers from six different countries. Decision-makers were...
IDC white paper: Delivering an Integrated Infrastructure for the Cloud
In an IDC White Paper sponsored by HP, IDC covers how cloud computing is one of the prevailing IT trends today and how...
HP Cloud Service Automation: Intelligent Automation for Building, Managing, and Securing Cloud Services
Many lines of business are now procuring cloud services on their own because cloud computing professes to do what IT has long promised:...
Benefits of Private Cloud and Infrastructure as a Service
This solution brief will help you understand the benefits of the HP CloudSystem Matrix which provides a unified solotion for physical and virtual...
All Cloud Computing White Papers
Cloud Computing Webcasts
Live Webcast
Integrated IT Operations Management in the Cloud
Join award-winning technology editor Stan Gibson and Andrew White, CMO at BMC, to learn how asset management and service management are converging and...
Live Webcast
The Higher-Bandwidth, Lower-Cost Connection of Choice: 10GBASE-T LAN on Motherboard
Learn how Expedient, a cloud provider, is using 10 Gigabit Ethernet to boost its services and rein in costs.
The Higher-Bandwidth, Lower-Cost Connection of Choice: 10GBASE-T LAN on Motherboard
Learn how Expedient, a cloud provider, is using 10 Gigabit Ethernet to boost its services and rein in costs.
Integrated IT Operations Management in the Cloud
Join award-winning technology editor Stan Gibson and Andrew White, CMO at BMC, to learn how asset management and service management are converging and...
Optimizing Networks for the Cloud
Join guest speaker, Rohit Mehra, IDC Director of Enterprise Communications Infrastructure, to explore current trends, discuss best practices for optimizing Data Center and...
De-risk Deploying Business Critical Apps in Your Private Cloud
Architect your private clouds to ensure that application requirements for performance & availability are achieved with minimal risk to the business.
Navigating the Public Cloud
InfoWorld contributing editor and consultant David Linthicum offers expert advice about choosing services to outsource to the public cloud providers, cloud data security...
All Cloud Computing Webcasts
Featured Cloud Computing Blog
Jean-Marc Seguin

Before you start making changes in the organization steam rolling toward the goal of a private cloud, it's important that you understand where you are today so that you can plot the right trajectory, and as you progress toward your goal, make course corrections as needed. more

Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs