Google fixes lengthy, widespread Gmail malfunction
A 10-hour disruption that affected email delivery and attachment downloads affected close to 50 percent of Gmail users
IDG News Service - A Gmail glitch that took about 10 hours to fix and hit close to 50 percent of the webmail service's users has been fixed, ending one of the longest, most widespread Gmail disruptions in years.
Affected users endured email delivery delays and difficulties downloading attachments due to a bug first acknowledged by Google at around 10:30 a.m. U.S. Eastern Time Monday. The company declared it patched at 10 p.m.
On its Google Apps Status site, the company pegged the start of the problem at close to 9 a.m. and its resolution at 6:30 p.m.
The issue affected individuals who use the free version of Gmail as well as businesses, schools and government agencies that pay for it as part of the Google Apps cloud collaboration and email suite.
In the U.S., the disruption covered most of the workday on both coasts, which heightened the impact of the bug for millions.
People who depend on Gmail for critical tasks took to Twitter, discussion groups and other online forums to express their frustration.
The last time Google gave an official figure for active Gmail users was more than a year ago, when it said there were more than 425 million.
Assuming conservatively that the service now has about 450 million active users, Monday's disruption likely affected more than 200 million users, plus senders on other email platforms whose messages weren't received in a timely fashion.
Google said that the severity and length of the impact varied among users. About 29 percent of messages received were delayed by an average of 2.6 seconds, but some mail was "severely delayed."
"We apologize for the duration of today's event; we're aware that prompt email delivery is an important part of the Gmail experience, and today's experience fell far short of our standards," the company wrote on the status site.
The incident is a big deal for both Google and those affected, but it shouldn't on its own dissuade CIOs from using the suite, said Forrester Research analyst TJ Keitt.
"Data centers hosting multi-tenant collaboration services aren't immune to disruptions. So, when they happen, the way to judge the vendor is on how well they identify and resolve the problem, and then inform the public to how they resolved the issue," Keitt said.
Using that criteria, Google's updates throughout the duration of the incident could have been more transparent and detailed regarding the nature of the problem and the strength of the fix that was put in place, he said via email.
"They have clearly not communicated this publicly, so I hope they've been forthcoming with this information with their clients," Keitt said on Monday night.
Meanwhile, Matthew Cain, a Gartner analyst, said the incident raises fundamental questions about what is considered downtime, especially as it relates to service-level agreements from cloud application vendors.
"If message delivery is delayed 15 minutes, is that considered downtime? What about 2 hours?," he said via email. "The move to cloud email puts a spotlight on these essential questions about how to meter and compensate for subpar messaging performance that is not traditionally classified as 'downtime.'"
On Tuesday, Google offered more details about the cause of the problem and the steps it's taking to prevent it from happening again.
The cause was a "very rare" dual network failure, which brought down two separate, redundant network paths, according to a blog post from Sabrina Farmer, senior site reliability engineering manager for Gmail.
"The two network failures were unrelated, but in combination they reduced Gmails capacity to deliver messages to users," she wrote.
Over the next few weeks, Google staffers will work on bulking up network and backup capacity for Gmail, as well as on making Gmail's message delivery more resilient in the event of a network crash, according to Farmer.
"Finally, were updating our internal practices so that we can more quickly and effectively respond to network issues," she wrote.
Juan Carlos Perez covers enterprise communication/collaboration suites, operating systems, browsers and general technology breaking news for The IDG News Service. Follow Juan on Twitter at @JuanCPerezIDG.
- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
- Slideshow: 7 security mistakes people make with their mobile device
- iOS vs. Android: Which is more secure?
- 11 sure signs you've been hacked
- Mobile Policy Checklist Here's what to consider when putting together a mobile policy designed to support a highly productive workforce.
- Mobile Applications Case Study: 8 Billion Transactions a Day The story documents how the online brokerage company tradeMONSTER created a custom mobile app and the success gleaned from this initiative. Also covered...
- Who's afraid of the big (data) bad wolf? Survive the big data storm by getting ahead of integration and governance functional requirements This paper provides a detailed review of the best practices clients should consider before embarking on their big data integration projects.
- Understanding big data so you can act with confidence Automating information integration and governance and employing it at the point of data creation helps organizations boost confidence in their big data.
- Mobile Apps and Devices Slash Customer Cycle Time Consolidated Engineering Laboratories' field employees used to collect data on triplicate forms that were sometimes hard to read and difficult to manage. After...
- Cloud Knowledge Vault Learn how your organization can benefit from the scalability, flexibility, and performance that the cloud offers through the short videos and other resources... All Desktop Apps White Papers | Webcasts