Office 365 email conks out twice within a week
Some customers in North and South America were affected
Network World - Microsoft's Office 365 service has suffered two email outages within a week of each other that affected some customers in North and South America that stemmed from different causes but ended in the same result: failed email delivery.
The first outage Nov. 8 stemmed from an overwhelmed antivirus engine and the subsequent backup that caused the service degradation. The second on Nov. 13 resulted from the failure of unspecified network elements, routine maintenance and increased load that combined to degrade service, according to the Office 365 blog posted by Rajesh Jha, the corporate vice president of Microsoft's Office division.
He didn't say how many customers were affected or where they were located other than somewhere on the two continents. Both outages affected just Office 365 Exchange Online mail services.
Affected customers are entitled to a service credit. Jha apologizes and promises a post mortem on the outages as well as an update on how the Office 365 service level agreement was affected.
The Nov. 8 incident started when an antivirus engine bogged down as it processed emails that the engine determined carried a particular virus. That delay processing emails led to retries that further bottlenecked email flow including legitimate emails, he says.
The issue was resolved by intercepting the tainted messages and quarantining them directly.
To head off similar problems down the line, the company has set a lower threshold for diverting problem emails and implementing faster remediation tools. It is also adding unspecified safeguards that automate remediation of this type of problem, Jha says.
The second incident Nov. 13 started with some scheduled maintenance that required shifting some of the load out of those data centers undergoing maintenance. During this work unspecified network elements failed but sent no alerts of their failure, he says. And finally the entire infrastructure was handling more traffic from new customers, all of which resulted in some customers being unable to access email services.
Traffic for affected users was shifted to healthy data centers while the issues were dealt with.
Jha says the company is in the midst of increasing capacity and is automating how equipment failures are handled to speed up recovery time.
In addition, the company is reviewing its processes to head off future outages.
"As I've said before," Jha blogs, "all of us in the Office 365 team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business - that's not acceptable. I want to assure you that we are investing the time and resources required to ensure we are living up to your - and our own - expectations for a quality service experience every day."
(Tim Greene covers Microsoft for Network World and writes the Mostly Microsoft blog. Reach him at email@example.com and follow him on Twitter https://twitter.com/#!/Tim_Greene.)
Read more about wide area network in Network World's Wide Area Network section.
- 15 Non-Certified IT Skills Growing in Demand
- How 19 Tech Titans Target Healthcare
- Twitter Suffering From Growing Pains (and Facebook Comparisons)
- Agile Comes to Data Integration
- Slideshow: 7 security mistakes people make with their mobile device
- iOS vs. Android: Which is more secure?
- 11 sure signs you've been hacked
- What Datapipe customers need to know about the new PCI DSS 3.0 compliance standard This handy quick reference outlines what PCI DSS 3.0 is, who needs to be compliant and how Alert Logic solutions address the new...
- The 12 PCI DSS 3.0 requirements addressed by Peer 1 Hosting This handy quick reference outlines the 12 PCI DSS 3.0 requirements, who needs to be compliant and how Alert Logic solutions address the...
- Defense Throughout the Vulnerability Life Cycle This whitepaper provides insight into how to leverage threat and log management technologies to protect your IT assets throughout their vulnerability life cycle.
- The Critical Role of Support in Your Enterprise Mobility Management Strategy Most business leaders underestimate the importance of tech support when they choose an EMM solution. Here's what to put on your checklist.
- Live Webcast Best Practices for the Hyperconverged Enterprise Network To the Age of Constant Connectivity and Information overload
- Live Webcast Unmasking the Differences between Consumer and Enterprise File Sync & Share The consumerization of IT combined with the rapid pace of the modern mobile workplace is forcing enterprise IT teams to evaluate file sync...
- Live Webcast Government Agency Webifies Outdated COBOL Applications Let this CTO tell you how his agency converted 1980s-era green screens into an e-filing portal for the 100,000 cases handled each year...
- The New Way to Work Knowledge Vault This Knowledge Vault focuses on how, in today's increasingly virtual world, it's more important than ever to engage deeply with employees, suppliers, partners,...
- Getting Ready for BlackBerry Enterprise Service 10.2 Find out how BlackBerry® Enterprise Service 10 helps organizations address the full spectrum of EMM challenges, while balancing the needs of both the... All Applications White Papers | Webcasts