Continuous delivery, company-wide hackathons, growth hacking, features driven by user feedback through sites like user voice: The new face of Microsoft is all about moving faster and being more responsive. Moving faster can also mean things breaking. But failure, and what you do when things go wrong, look rather different in a cloud-first mobile-first world.
Fast response is part of it, and that means you have to be developing fast too, otherwise your systems won't be designed to get fixes out quickly. That's why Azure has new features every two or three weeks, according to technical fellow Mark Russinovich.
"The only way to make it so you can get more stable is to release more often. Once you get your systems -- from your engineering systems, to your deployment systems, to your monitoring systems tuned, so you're getting things out quickly and detecting where health goes awry really quickly, then you don't have to let things bake for ever."
Outage outrage and the blame game
Come the next big cloud outage, be it Office 365 or Amazon Web Services, you can guarantee that a few things will happen. At first it won't be clear if anything is wrong, because failures in one region won't affect customers who have designed for resilience. Then it won't be clear what exactly is wrong, because it takes an unusual combination of problems to expose where the flaws are in systems designed to be resilient to everyday issues).
Then after the problem is dealt with, there will be a write-up with some technical details -- in the case of Azure, Russinovich will follow that up with an intricately detailed presentation at conferences like BUILD and TechEd -- and the promise to tackle any underlying problems that were discovered.
Those explanations will be greeted by comments that any service that fails isn't good enough, the failure shouldn't have happened, and customers should be demanding someone's head on a platter.
But -- frustrating as outages are -- perfection isn't the way the cloud world works, any more than you can find hard drives that never crash and servers that never hang.
As David Bills, Microsoft's chief reliability strategist, put it recently: "While you may not want to scream it from the rooftops, with cloud services, failure is, in fact, inevitable. And because of that, cloud services need to be designed ... to contain failures and recover from them quickly.... It is no longer about preventing failure. It is about designing resilient services in which inevitable failures have a minimal effect on service availability and functionality."
Test, fail, repeat
Not only are failures inevitable, they're also something any good cloud service will be causing on purpose. Microsoft calls it "live site testing," and like much of Microsoft's current development thinking, it was pioneered at Bing. It starts with the kind of A/B testing that Ronny Kohavi built the Microsoft Experimentation Platform to manage (he's now general manager of the analysis and experimentation team that's part of Microsoft's attempt to harness big data for product design).
These tests, which are now used to run around 250 experiments a day for Bing, have dramatically improved the Bing business. For example, experimenting with font colors improved monetization by $10 million a year, and Kohavi claims "a significant portion" of the 47% Bing revenue growth announced last October was thanks to two small changes tested through these kind of experiments -- each of which brought in an extra $100 million a year.
Counterintuitively, cloud services also have to try making things worse by deliberately introducing failure. Bills calls it "fault injection as part of a continual release process meant to help us build more reliable cloud systems."
It's an idea Amazon adopted over a decade ago when ex-firefighter Jesse Robbins (and later founder of Chef) used his training in incident response to run the GameDay program: Take down key systems to find out how resilient your service is. Robbins was known at Amazon as the Master of Disaster.A Google has its own equivalent program, which includes simulating natural disasters to take out entire offices.A Netflix calls its version the Chaos Monkey.
A Deliberately taking down a system that's working sounds like a crazy idea, but because cloud services depend on so many interdependent things -- from network routing to physical servers to software services to the people that run it all, and the mobile devices we use to access them to the power system -- the only way you know if you've designed your system to cope with failure is to make it fail and see what happens.
John Allpaw at Etsy, who famously promised that he'd never fire someone for taking down a service or site he's responsible for, points out that it's cheaper to schedule failure. A
"Triggering failures in a controlled manner represented an opportunity for us to learn some really important big lessons at a much lower cost and with far less disruption than would be the case if we just waited for problems to surface on their own."
You get to learn what happens when a failure happens at the same internet scale your cloud service runs at, not just at the far smaller scale of a test system. You find out how the people running the service respond to stress and failure and you learn how to learn from situations where things go wrong.
If you're running a cloud service properly, these kinds of tests shouldn't be that different from how people run the service from day to day -- they should be used to testing as part of development, live testing, and handling failures. And they should be used to doing that without finger-pointing and the idea that a project isn't finished until the blame has been assigned.
Get over the blame game
If you cause failures on purpose, there's no one to blame and you can focus on understanding what's going wrong. But accepting that failure is inevitable changes the way you do post-mortems on accidental failures as well.
It's still common to see firings after major IT problems, but with today's complex systems -- especially cloud systems -- failures aren't likely to be one person's fault. The problems that take down cloud services are combinations of flaws and interaction between different issues, which can include external failures, people not following procedures, cascading failures that increase the load on other systems and mistakes that might have been made in lots of different development processes for different parts of the service. There's rarely a single, simple root cause or a single person making a mistake. Unless the person you fire is responsible for a culture that makes problems and failures more likely, hanging them out to dry doesn't do anything to fix the problems.
It's the same kind of thinking you see in the increasingly popular devops approach; you don't look at who's to blame for an issue, you look at how you can mitigate it and then you look at fixing the underlying problem. The less of a blame culture you have, the more people will be willing to try things out and work together to get things running smoothly. And that's the best way to fix problems and get the service back online faster.