The worst cloud outages of 2013 (part 2)

We've seen some embarrassingly bad cloud outages in 2013 from some of the biggest names in technology. So which fail was the worst of them all?

Credit: iStockphoto
Cloudy with a chance of failure

Like any technology, cloud services aren't perfect. When a cloud service stumbles, though, the whole world notices.

We saw plenty of high-profile cloud outages in the first half of 2013 -- and the last half of the year has been no different. From a 5-minute failure that cost half a million dollars to a week-long disruption that cost an immeasurable amount of brand damage, some of technology's biggest players have been playing defense with both their servers and their public images.

Here are the worst mishaps from the second half of the year.

Google goes down

Date: July 10, 2013

Duration: About 40 minutes

Failure: From a productivity standpoint, what would be the worst Google services to go offline for 40 minutes on a Wednesday while you're trying to work? Maybe Gmail, Google Drive (including the entire Google Docs editing suite), and Google Calendar?

Yup -- those are exactly the services that crashed on the morning of July 10, forcing countless users to pace around nervously and/or take extended coffee/hoagie breaks.

The Google outage aftermath

Fallout: People freaked out. Well, virtually, anyway. 

The tweets tells the story. "Gmail being down has brought my productivity this morning to a complete halt," one user wrote. "If you listen carefully, you can hear the screams of anguish as people realize that yes, Gmail is down," another observed. And: "Is the NSA doing an upgrade?" a wisecracker quipped.

Fix: Google used its Apps Status Dashboard to keep users apprised of its progress. Once things were back up and running, the company apologized for the inconvenience and assured folks that "system reliability is a top priority at Google" and that the company was "making continuous improvements to make [its] systems better."

Credit: Flames: iStockphoto
Facebook falls flat

Date: June 18, 2013

Duration: About 30 minutes

Failure: It's the rare cloud outage that actually increases productivity: On an otherwise quiet Tuesday evening in June, Facebook went dark for a large number of users around the world. Trying to get to the site netted you nothing more than a "Sorry, something went wrong" error message in the desktop browser -- or a vague "network error" message in the mobile app.

The Facebook outage aftermath

Fallout: No tantalizing breakfast recaps from your auntie, no awkward tales of child rearing from your old college roommate, and no shared e-cards from that weird lady you used to work with -- oh my! How the world manages to survive without Facebook for half an hour is beyond me. Of course, for orgs and devs whose lifeblood depends on Facebook (see Gamers Unite, at left), the flameout was justifiably enraging. 

Fix: Facebook says "an internal issue in [its] Web infrastructure" was to blame for the site's temporary shutdown. The company said only that it "resolved the issue quickly," without going into detail.

Whew -- close call. Just imagine what might have happened if the service had stayed offline for a full hour.

Outlook.com gets knocked out

Date: August 14-17, 2013

Duration: About three days

Failure: Outlook.com offline for three days? Ouch. Microsoft's major outage started on a Wednesday morning in August, when the Web mail service partially shut down and wouldn't let some users into their inboxes. The issue also affected SkyDrive and the company's Contacts service.

The services worked on and off over the following days, but it wasn't until midway through the next weekend that everything got back to normal.

The Outlook.com outage aftermath

Fallout: When a three-day outage occurs just days after a company boasts about its uptime, a good amount of crow has to be eaten. Microsoft apologized to users, saying a failure in its Exchange ActiveSync caching service caused the cascading crashes: As countless devices received errors and kept trying to connect to the company's servers, the rush of traffic proved to be too much for its machines for handle. 

Fix: Microsoft says it increased network bandwidth in the area of the system that failed and also altered the way Exchange ActiveSync error handling is, well, handled. The company believes those changes should prevent such a mess from happening again.

Amazon.com isn't always on

Date: August 19, 2013

Duration: Around 40 to 45 minutes

Failure: Amazon.com crashing is kind of like Wal-Mart's doors getting stuck shut; it just doesn't happen. At least, not usually.

A rare exception occurred on a back-to-school shopping day in August, when Amazon's virtual gates became blocked for the better part of an hour. The website and its mobile apps wouldn't work, serving up either broken links or error messages.

Most third-party companies utilizing Amazon Web Services seemed to be unaffected, with the exception of Amazon's own Audible.com, which also suffered downtime during the blip.

Credit: Duncan Smith
The Amazon outage aftermath

Fallout: Using Amazon's year-long net sales for 2012, one publication estimated that the site could have lost as much as $1,104 per second during the time it was offline. That's $66,240 in lost sales per minute -- no small chunk of change.

Fix: Amazon.com came back online with little fanfare. The company offered no public explanation for what exactly went wrong.

Amazon Web Services stumbles

Date: August 25, 2013

Duration: About an hour

Failure: August just wasn't a good month for Amazon. Mere days after its Amazon.com outage, the company's Amazon Web Services cloud computing force reported "degraded performance" with an EC2 service in Northern Virginia along with "connectivity issues" for Elastic Load Balancing systems at the same center.

The AWS outage aftermath

Fallout: Numerous companies that depend on AWS for their businesses -- including big-name players like Instagram and Vine -- were offline for part of the afternoon until service was restored.

Fix: Amazon says the outage was ultimately caused by "a 'grey' partial failure with a networking device" that resulted in unexpected packet loss. Engineers removed the networking device in question and sent it to a lab for testing. Goes to show how little needs to go down to have wide-reaching effect.

Credit: Lightning: John Fowler/Flickr
Apple iCloud weathers a storm

Date: August 22, 2013

Duration: Around 11 hours

Failure: Say what you want about Apple, but the company's cloud services haven't exactly garnered a great reputation. Things took a particularly bad turn in August, when all iCloud services went dark for lots of folks for most of a day. The outage actually affected a small number of users, Apple says -- "less than 1 percent" -- but when you look at the company's grand totals, that could still be as many as 3 million people.

Apple had similar iCloud outages in February, when iCloud failed for several hours for some users, and in June, when the service went kaput for many a customer.

The iCloud outage aftermath

Fallout: iCloud encompasses an awful lot of Apple services, so when it's offline, affected users are unable to access such features as iMessage, Photo Stream, Documents in the Cloud, Backup and Restore, and iPhoto Journals.

Fix: Being that Apple is, you know, Apple, the company didn't reveal much of anything about what actually caused its outage or what it did to fix it and prevent future recurrences. C'est la vie in the land of Cupertino.

Bad times for The Times

Date: August 14 and 27, 2013

Duration: Several hours total

Failure: What happens when the newspaper of record goes off the grid? The website for The New York Times had a rough August: The publication was offline for about 90 minutes on August 14. Then, a few days later, the site went offline again in what appeared to be a far more serious situation -- one related to a domain name hijacking blamed on a hacking entity known as the Syrian Electronic Army.

The NYTimes outage aftermath

Fallout: With no website to use, The Times resorted to posting news stories on its Facebook page (yes, really). In the second instance, the publication also had to ask its staff members to use caution in sending emails because of the nature of the incident.

Fix: According to The New York Times, the first failure was the result of a server outage that happened right on top of a scheduled maintenance update. Finding a fix for the second situation was a bit more involved: The Times says the attack was directed at its domain name registrar and was repeated multiple times. An executive at the publication noted that a registrar "should have extremely tight security" to prevent such instances from occurring.

Google falters again

Date: August 16, 2013

Duration: Five minutes

Failure: Just over a month after its July apps outage, Google's entire suite of Web-based services -- including, in a rare instance, the actual Google.com home page -- went MIA for about five minutes. You wouldn't think five minutes would be much, but when you're talking about Google.com and all its related properties, that amount of time can seem like an eternity.

The Google outage aftermath

Fallout: Just how bad can five minutes really be? According to one analysis, Google's downtime caused a whopping 40 percent drop in global Internet traffic

The number of tweets per minute about Google, meanwhile, was said to increase five-fold -- jumping from a typical average of 200 per minute to well over 1,000 per minute during the outage.

As for financial impact, by doing a little math with Google's second-quarter revenue, one writer estimated the company could have lost around $545,000 during the short time it was down.

Fix: Google never went into detail about what caused its recess, saying only that it had "received reports of an issue affecting some Google services" and that the issue had been resolved.

Credit: Flame: Wikimedia / Catholic Junior College
Amazon Web Services, the sequel

Date: September 13, 2013

Duration: Just under two hours

Failure: In honor of Friday the 13th, Amazon Web Services decided to give a repeat performance of its summer shutdown. This one was limited to a single zone -- specifically, Amazon's data centers in the North Virginia area -- and was said to resolve around the ever-popular "network connectivity issues."

Credit: Sever room: senticus
The AWS outage aftermath

Fallout: Some AWS customers suffered either downtime or partial unavailability during the outage. Heroku and Github were among those affected.

Fix: Amazon got its issues ironed out but declined to go into detail about what precisely had gone wrong or what steps it took to correct the problem.

Gmail, Gfail

Date: September 23, 2013

Duration: Several hours

Failure: Google's Gmail service took a hit in late September when some users found their messages being delivered a full two hours late -- and sometimes without any attachments available.

Credit: iStockphoto
The Gmail outage aftermath

Fallout: Google says the extreme delays affected only about 1.5 percent of all messages being sent through Gmail. Still, 1.5 percent of all Gmail messages is no small number, and a two-hour delivery delay can be a serious problem in our world of instant communication.

Fix: According to Google's post-mortem, the issue was the result of a rare "dual network failure" in which "two separate, redundant network paths both stop[ped] working at the same time." After restoring the network capacity and clearing out the backlog of late messages, Google promised to boost its network and backup capacity and to make Gmail message delivery more resilient even with limited resources.

Verizon takes down HealthCare.gov

Date: October 27, 2013

Duration: Several hours

Failure: Let's face it: The HealthCare.gov website hasn't exactly been a well-oiled machine. On a Sunday in October, however, it wasn't the site itself but rather Verizon that caused everything to come crashing down for several hours.

A data center powered by Verizon's Terremark service took a nosedive that day -- and took HealthCare.gov down with it. Talk about bad timing.

The Verizon outage aftermath

Fallout: With all the technical snafus the HealthCare.gov site had already seen, the Verizon Terremark outage caused surprisingly little outrage. Maybe people just assumed it was more of the same bad juju that had plagued the site before?

Fix: Verizon worked through the night and got its data center back online the next morning. The company said the outage stemmed from an "issue with [a] networking component"; it reportedly affected an entire data services hub that handled traffic for multiple federal agencies.

Microsoft has another meltdown

Date: November 21, 2013

Duration: On and off for a few hours

Failure: A bunch of Microsoft services kicked the bucket in late November, with outages being reported for Outlook.com, Office365.com, Windows Azure, Xbox Live, and even some general Microsoft Web properties.

The Microsoft outage aftermath

Fallout: The far-reaching outage generated pages of complaints on places like Hacker News and even hit the tabloids, with The Register declaring the incident a "global cloud catastrophe." A slight exaggeration, perhaps, but isn't that what the Internet is for?

Fix: Microsoft didn't go into great detail but did say its Azure outage was "a separate issue" from its other online service interruptions. Some reports suggested an internal DNS glitch may have been to blame.

Yahoo Mail stops delivering

Date: December 9-13, 2013

Duration: 5+ days

Failure: 2013 just hasn't been the year for Yahoo Mail. Following ongoing complaints about a redesign of the service, the email app stopped working altogether for many users during the second week of December.

The issues surfaced Dec. 9th, when some users found themselves unable to sign into Yahoo Mail. By the 12th, Yahoo said "most affected users" had regained access, with the exception of IMAP. The company also said that some messages dating as far back as November 25 were backlogged in the system and would be further delayed in delivery.

On Friday the 13th -- fittingly -- CEO Marissa Mayer issued a formal apology and said things were almost back to normal.

The Yahoo Mail outage aftermath

Fallout: The combination of frustration from the outage and lingering irritation from the UI-change fiasco caused a storm of discontent. Making matters worse was Yahoo's slow public response to the matter -- what All Things D's Kara Swisher described as a "lack of PR savvy in dealing with [the] situation."

Fix: Slowly but surely, Yahoo got its act together. The company blamed a vague "hardware problem" for the outage, saying the issue was "harder to fix" than what it had anticipated and that it took "dozens of people working around the clock" to get things back to normal.

Mayer noted that the problem was "particularly rare" and that the company would work on "improvements" to prevent similar issues.