Amazon cites cause of recent outage, issues refunds
Network World - An unexpected bug cropped up after new hardware was installed in one of Amazon Web Service's Northern Virginia data centers, which caused the more than 12-hour outage last week that brought down popular sites such as Reddit, Imgur, AirBNB and Salesforce.com's Heroku platform, according to a post-mortem issued by Amazon.
In response to the outage, AWS says it is refunding certain charges to customers affected by the outage, specifically those who had trouble accessing AWS application program interfaces (APIs) during the height of the downtime event.
AWS says the latest outage was limited to a single availability zone in the US-East-1 region, but an overly aggressive throttling policy, which it has vowed to fix as well, spread the issue for some customers into multiple zones.
The problem arose Oct. 22 from what AWS calls a "latent memory bug" that appeared after a failed piece of hardware had been replaced in one of Amazon's data centers. The system failed to recognize the new hardware, which caused a chain reaction inside AWS's Elastic Block Storage (EBS) service, and eventually spread to its Relational Database Service (RDS) and its Elastic Load Balancers (ELBs). Reporting agents inside the EBS servers kept attempting to use the failed server that had been removed.
"Rather than gracefully deal with the failed connection, the reporting agent continued trying to contact the collection server in a way that slowly consumed system memory," the post-mortem reads. It goes on to note that "our monitoring failed to alarm on this memory leak."
AWS says it's difficult to set accurate alarms for memory usage because the EBS system dynamically uses resources as needed, therefore memory usages fluctuate frequently. The system is supposed to work with a degree of fault tolerance for missing servers but eventually the memory loss became so severe that it started impacting customer requests. From there, the issue snowballed -- "the number of stuck volumes increased quickly," AWS reports.
AWS first reported a small issue at 10 a.m. PT but within an hour said the issue was impacting a "large number of volumes" in the affected availability zone. This seems to be the point when major sites such as Reddit, Imgur, AirBNB and Salesforce.com's Heroku platform all went down. By 1:40 p.m. PT, AWS said, 60% of the impacted volumes had recovered, but AWS engineers were still baffled as to why.
"The large surge in failover and recovery activity in the cluster made it difficult for the team to identify the root cause of the event," the report reads. Two hours later the team figured out the problem and restoration of the remaining impacted services continued until it was almost fully complete by 4:15 p.m. PT.
- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
- Slideshow: 7 security mistakes people make with their mobile device
- iOS vs. Android: Which is more secure?
- 11 sure signs you've been hacked
- ESG: The IBM FlashSystem 840: Technical Evolution to Deliver Business Value In this whitepaper, you will learn how this high-speed storage technology has tremendous potential to support I/O-intensive and/or latency-sensitive applications.
- Choosing an MDM Platform: Where to Start the Conversation If you're in the early stages of choosing an MDM solution, or you're considering switching vendors, here are seven critical questions to ask...
- Axeda Platform Technical Overview This paper summarizes the major features of an IoT platform and explains how they simplify and speed the process of developing and deploying...
- Stock Shock: The effect of project and portfolio management on share price In this independent report, you'll see the intrinsic connection between long-term capital investment and short term market performance -- and how this can...
- Meg Whitman presents Unlocking IT with Big Data During this Web Event you will hear Meg Whitman, President and CEO, HP discuss HAVEn - the #1 Big Data platform, as well...
- Cloud Knowledge Vault Learn how your organization can benefit from the scalability, flexibility, and performance that the cloud offers through the short videos and other resources... All Cloud Computing White Papers | Webcasts