Oops; Amazon Web Services EC2 cloud lost data

Amazon logo
By Richi Jennings. April 27, 2011.

Amazon (AMZN) regrets to announce that its Amazon Web Services EC2 (Elastic Compute Cloud) service has permanently lost some customer data. After the extended outage last week, some storage volumes seem to be unrecoverable. In IT Blogwatch, bloggers learn virtualization lessons.

Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention Simon's Cat, in 'Hop It'...

Patrick Thibodeau points the apologia-finger:

Amazon's partial outage, which began Thursday and seemed largely resolved today, was an exceptional event. ... The uptime reliability ... of the largest providers of cloud-based services ... shows how well cloud providers are delivering uninterrupted services.

...

The risk of an outage is generally very low. ... The overall industry yearly average of uptime ... is 99.948% [or] 273 minutes of unavailability per year. ... The best providers are at 99.9994%, or three minutes.  
M0RE

But, as Cade Metz notes, Amazon has lost some data:

About 0.07 per cent of the EBS storage volumes in the East Region of ... its EC2 (Elastic Compute Cloud) service ... are not "fully recoverable" following the extended outage. ... Amazon divides its ... cloud into multiple geographic regions ... guaranteeing 99.95 per cent availability within each region.

...

On Monday, it announced that some volumes would not be restored. ... It is in the process of contacting these customers.  
M0RE

Amazon's status monkeys fell on their corporate swords, thus:

A networking event early [April 21] triggered a large amount of re-mirroring of EBS volumes in US-EAST-1 ... creating a shortage of capacity in one of the ... Availability Zones. ... Additionally, one of our internal control planes for EBS has become inundated.

 ...

We have completed our ... recovery efforts ... we've recovered nearly all of the stuck volumes ... [but] 0.07% of the volumes in our US-East Region will not be fully recoverable.  
M0RE

So Maureen O'Gara uses a colorful metaphor:

It’s unclear how much 0.07% represents, but there were a lot more sites on Amazon’s bollixed North Virginia Availability Zone than most people imagined.

...

Like Humpty Dumpty no matter what all the king’s horses and all the king’s men do they’re not going to get put back together again.  
M0RE

And Thorsten von Eicken teaches lessons learned vicariously:

[We] heard from a good number of users who ... didn’t set up redundancy properly. Hindsight is always 20-20. ... Backup and replication have to be taken seriously. ... In EC2 this means live replication across multiple availability zones. ... A minimum of replicas must be running. ... Over-provisioning is necessary to handle the load spike after a massive failure.

...

NoSQL databases ... [are] not a silver bullet by a long shot. ... [They can] have pretty complex dynamics that can easily lead to unpleasant surprises.

...

We also were confused by Amazon’s status messages ... a clear message from Amazon that more and more volumes were continuing to fail in the zone would have been really helpful.  
M0RE

   But Glenn Weinstein is all, like, don't-worry-be-happy:

The usual naysayers jumped ... on a perceived sign of weakness from the leading infrastructure-as-a-service ... provider. ... Shamefully, some alleged advocates of cloud computing for the enterprise joined the fray. ... This kind of "I told you so" finger-wagging is ill-timed, particularly with regard to the ... implications for the enterprise.

...

Amazon's infrastructure ... offers better levels of backup, failover, and load balancing than most departmental IT teams are prepared to develop. ... By highlighting ... a single large outage, we obscure the fact that countless small outages were avoided.

...

To read some AWS critics ... you'd think the "cloud first" CIOs had been proven wrong. They haven't. ... Start considering capping your current level of investment in on-premise infrastructure and systems.  
M0RE


 
And Finally...
Simon's Cat: 'Hop It'


 
Don't miss out on IT Blogwatch:

Richi Jennings, your humble blogwatcher

Richi Jennings is an independent analyst/consultant, specializing in blogging, email, and security. He's also the creator and main author of Computerworld's IT Blogwatch -- for which he has won American Society of Business Publication Editors and Jesse H. Neal awards on behalf of Computerworld, plus The Long View. A cross-functional IT geek since 1985, you can follow him as @richi on Twitter, pretend to be richij's friend on Facebook, or just use good old email: itbw@richij.com. You can also read Richi's full profile and disclosure of his industry affiliations.

Copyright © 2011 IDG Communications, Inc.

Shop Tech Products at Amazon