Conde Nast embraces chaos engineering with PagerDuty to protect against outages

Getty Images

Until recently, the 100-year-old publisher of Vogue and Vanity Fair, Conde Nast, had offices in each country acting as independent business units with differing processes and workflows. But a transformation drive under the new umbrella of Conde Nast International, which looks after the offices outside of North America, saw the business turn to incident response and visibility organisation PagerDuty to help protect against costly outages.

A few years ago the relatively new Conde Nast International decided to build a unified technology platform that could be rolled out across the globe, modernising all of the wholly-owned countries (and then the licensee countries) along the way.

"The whole point was to build this new platform, it's mainly in Node.js and React JavaScript in AWS," says director of engineering and cloud infrastructure at Conde Nast, Crystal Hischorn. "The reason we started using PagerDuty was because with this platform, we're also launching 60 websites across these markets and brands - and beyond that will be integrating more SaaS-type products and building more products on top of this platform."

Hischorn, who is a software engineer of 15 years and previously had worked with PagerDuty at the BBC, says that when Conde Nast International was searching for a visibility and workflows solution there really weren't many options to pick from - and although the team also looked at the now Splunk-owned VictorOps, the clear option was PagerDuty, not only for its technology but also the familiarity that architects within digital publishing tend to have with the platform.

PagerDuty has been indispensable in the organisation's chaos engineering efforts according to Hischorn.

"For us it was around the chaos engineering aspect - what we essentially do is simulate outages in different parts of our technology stack, so that could be in our infrastructure layers, it could be our networking layers, it could be in our application layers, it could be in the CDN layers. We just simulate outages, and PagerDuty is quite good about triggering the alarms - just testing that the escalation path is working, that the polices are correct there."

She also noted the importance of the company's decision to open source its incident management training as indispensable. Of course, because it's open source, any company can take a look at the material, not just PagerDuty customers, which she says is an important resource to be made freely available.

"I think it's really good to have because a lot of companies struggle with actually formulating an incident response workflow and particularly in things like where there's big outages, that's key," says Hischorn. "So I think that's great that PagerDuty open sourced that, because a lot of companies are really interested in how it's defined at a company like PagerDuty, and particularly because a lot of people use the tool in their incident management process.

"We've definitely set this up in a way that will be extremely smooth when the time comes when we have a real outage that could be a P0 or a P1, so it's been really big for us."

Read next: How the Met Office embraces 'chaos' to test its new cloud infrastructure

The visibility provided by PagerDuty also helps unify the teams working across different departments, bringing their workflows together in a "way that makes sense to everyone" and helping to reinforce a devops model at the company.

"For me [devops] is really about people having expertise in development, some will have expertise in operations, you need to try and make sure they overlap somewhat," says Hischorn. "And there's an understanding on both sides that it's not trying to make people lose their specialism necessarily, but our developers are able to operate what they build and similarly the operations people know how development works. I think this is a critical piece of software in helping achieve that."

Hischorn adds that a slew of digital tools such as PagerDuty working in tandem with analytics vendor DataDog has helped to reduce developer burnout.

"You can suppress alarms when you need to, you can be setting up the workflows in the right way so you can say, this system is critical, this one's not, this one's in production, this one's not," says Hischorn.

Copyright © 2018 IDG Communications, Inc.

How to supercharge Slack with ‘action’ apps