Amazon S3 is MIA; SLA full of FAIL

It's IT Blogwatch: in which Amazon's "three-nines-reliable" S3 service was down for EIGHT HOURS, Sunday. Not to mention more Error'd...

Here's Howard Dahdah frahm dahnahdah: [You're a dag, mate -Ed.]

A systems failure at Amazon's S3 hosted storage service this morning has affected a host of Web 2.0 applications such as Twitter and SmugMug, which are dependent on it for the delivery of their applications. At 9:05AM PDT on Sunday (AEST 2.05 AM Monday) Amazon issued an outage report, claiming it was experiencing "elevated error rates with S3" ... At 5:12 PM PDT (10.12 AM AEST) the S3 site was restored.

...

A variety of businesses such as Twitter, digital photo sharing Web site, SmugMug and The Huffington Post all had issues. Twitterers were claiming their avatar images could not be displayed. The Huffington Post was also unable to display images to its stories, while SmugMug could not offer any service at all ... Another of the sites to be affected was Jungle Disk, which provides data and file storage. Its tag line is: "Reliable online storage powered by Amazon S3." more

Mike Gunderloy adds:

As was the case back in February the last time this happened, the outage is apparent by chunks of Web 2.0 dropping off ... Amazon learned from the last outage that transparency is a must. If you visit the Service Health Dashboard, you can see that they know about the outage and are “pursuing corrective action”

...

Amazon does offer an SLA for the S3 service, guaranteeing 99.9% uptime or part of your money back. With .1% of a month being around 45 minutes, that means they owe people money. The requirements for claiming a refund, though, are onerous enough that no one except large users will bother.

...

With two relatively serious outages in the space of 6 months, some will be asking the question of why depend on S3? The answer is simple: the rates are hard to beat. more

Allen Stern's site was affected:

Images are broken because of it and I had to move the style sheet back so the site at least renders correctly. Sites like Twitter have massive broken images ... [S3] was also down this past February and Amazon explained the reasons for the outage and downtime a few days later. There has to be a way to failover when S3 is down. more

And it affected Dan Frommer's phone:

Amazon just made one of my iPhone apps crash -- Twinkle, the Twitter app by Tapulous. Not on purpose, of course. But still, a reminder to companies that relying on third-party "cloud" hosting services -- even from giant, normally excellent Amazon -- has its drawbacks.

...

Tapulous relies on S3 for images like Twinkle user icons. And they must not have included a "plan B" in their code to handle an image server outage. So when S3 hiccuped, and Twinkle couldn't download images, the app crashed, taking me back to the iPhone home screen. (And, hours later, it's still crashing.) more

Antonio Rodrigez asks:

The story goes something like this: aware that they had built up incredibly robust excess capacity for handling the peaks of e-commerce traffic on Amazon.com, the bright minds from Seattle decided to offer the same capacity to the rest of the web, kicking off the era of cloud computing for the thousands of customers that signed up for their triple threat of services: S3 (storage), EC2 (compute cyles), and SQS (messaging queues).

And yet, if AWS is using Amazon.com's excess capacity, why has S3 been down for most of the day, rendering most of the profile images and other assets of Web 2.0 tapestry completely inaccessible while at the same time I can't manage to find even a single 404 on Amazon.com? Wouldn't they be using the same infrastructure for their store that they sell to the rest of us? more

And foobar bletches:

For those of you who are not familiar with Amazon S3: It's a cloud-computing service ... It allows for cheap, and usually reliable on-demand online storage ... These cloud-services are popular, because they allow you to build massively scalable architectures without having to spend a dime on hardware. You only pay for the storage or the computing capacity you use in their data centres.

...

I am impressed by how powerful the concept is and how well it normally works. I also believe that Amazon learns from glitches like this and manages to improve their system as a result of it. But nevertheless, it is still relatively young, and so apparently not all issues are sorted out yet.

Needless to say, a 6 or 7 hour outage means a lot of egg on Amazon's face. That's not the kind of publicity they want. Still, though, if you can think of strategies to soften the impact of outages of individual components in your architecture, I would still recommend the Amazon services if you are a startup in search for cheap and scalable computing and storage resources. more

But Loïc Le Meur feels smug:

I am glad we moved all Seesmic to Servepath from Amazon S3 a few months ago, they have been down only once and very short outage, which is not acceptable given that it is not cheap hosting, but still we are very happy about it so far.

Hosting is a key part of building a service, and I heard many friends telling me to use Amazon S3 which we did at the beginning but then decided to move away. more

Richard MacManus is stoic:

But it does make us ask questions such as: why can't we get 99% uptime? Or: isn't this what an SLA is for? ... I guess the answer to the question, how much is too much downtime, is: hey, whataya gonna do? (imagine that said in a New York accent and with a shrug). more

And finally...

  • Error'd:
    • JIT taken to extremes
    • Huge public BSODs
    • "Odd" Samsung Konglish
    • Scientific phone number
    • ...and more

Buffer overflow:

Other Computerworld bloggers:

RSS feed icon
Like this stuff? Subscribe to the RSS feed.

Richi Jennings is an independent analyst/adviser/consultant, specializing in blogging, email, and spam. A 21 year, cross-functional IT veteran, he is also an analyst at Ferris Research. You can follow him on Twitter, pretend to be Richi's friend on Facebook, or just use boring old email: blogwatch@richi.co.uk.

Previously in IT Blogwatch:

Copyright © 2008 IDG Communications, Inc.

  
Shop Tech Products at Amazon