Etsy, which describes itself as an online “marketplace where people around the world connect to buy and sell unique goods,” is often trotted out as a poster child for DevOps. The company latched onto the concepts early and today is reaping the benefits as it scales to keep pace with rapid business growth. Network World Editor in Chief John Dix caught up with Etsy VP of Technical Operations Michael Rembetsy to ask how the company put the ideas to work and what lessons it learned along the way.
Let’s start with a brief update on where the company stands today.
The company was founded and launched in 2005 and, by the time I joined in 2008 (the same year as Chad Dickerson, who is now CEO), there were about 35 employees. Now we have well over 600 employees and some 42 million members in over 200 countries around the world, including over 1 million active sellers. We don’t have sales numbers for this year yet, but in 2013 we had about $1.3 billion in gross merchandise sales.
How, where and when did the company become interested in DevOps?
When I joined things were growing in a very organic way, and that resulted in a lot of silos and barriers within the company and distrust between different teams. The engineering department, for example, put a lot of effort into building a middle layer – what I called the layer of distrust – to allow developers to talk to our databases in a faster, more scalable way. But it turned out to be just the opposite. It created a lot more barriers between database engineers and developers.
Everybody really bonded well together on a personal level. People were staying late, working long hours, socializing after hours, all the things people do in a startup to try to be successful. We had a really awesome office vibe, a very edgy feel, and we had a lot of fun, even though we had some underlying engineering issues that made it hard to get things out the door. Deploys were often very painful. We had a traditional mindset of, developers write the code and ops deploys it. And that doesn’t really scale.
How often were you deploying in those early days?
Twice a week, and each deploy took well over four hours.
Twice a week was pretty frequent even back then, no?
Compared to the rest of the industry, sure. We always knew we wanted to move faster than everyone else. But in 2008 we compared ourselves to a company like Flickr, which was doing 10 deploys a day, which was unheard of. So we were certainly going a little bit faster than many companies, but the problem was we weren’t going fast with confidence. We were going fast with lots of pain and it was making the overall experience for everyone not enjoyable. You don’t want to continuously deploy pain to everyone. We knew there had to be a better way of doing it.
Where did the idea to change come from? Was it a universal realization that something had to give?
The idea that things were not working correctly came from Chad. He had seen quite a lot in his time at Yahoo, and knew we could do it better and we could do it faster. But first we needed to stabilize the foundation. We needed to have a solid network, needed to make sure that the site would be up, to build confidence with our members as well as ourselves, to make sure we were stable enough to grow. That took us a year and a half.
But we eventually started to figure out little things like, we shouldn’t have to do a full site deploy every single time we wanted to change the banner on the homepage. We don’t have any more banners on the homepage, but back in 2009 we did. The banner would rotate once a week and we would have to deploy the entire site in order to change it, and that took four hours. It was painful for everyone involved. We realized if we had a tool that would allow someone in member ops or engineering to go in and change that at the flick of a button we could make the process better for everyone.
So that gave birth to a dev tools team that started building some tooling that would let people other than operational folks deploy code to change a banner. That was probably one of the first DevOps-like realizations. We were like, “Hey, we can build a better tool to do some of what we’re doing in a full deploy.” That really sparked a lot of thinking within the teams.
Then we realized we had to get rid of this app in the middle because it was slowing us down, and so we started working on that. But we also knew we could find a better way to deploy than making a TAR file and SSH’ing and R-synch’ing it out to a bunch of servers, and then running another command that pulls the server out of the load balancer, unpacks the code and then puts the server back in the load balancer. This used to happen while we sat there hoping everything is ok while we’re deploying across something like 15 servers. We knew we could do it faster and we knew we could do it better.
The idea of letting developers deploy code onto the site really came about toward the end of 2009, beginning of 2010. And as we started adding more engineers, we started to understand that if developers felt the responsibility for deploying code to the site they would also, by nature, take responsibility for if the site was up or down, take into consideration performance, and gain an understanding of the stress and fear of a deploy.
It’s a little intimidating when you’re pushing that big red button that says – Put code onto website – because you could impact hundreds of thousands of people’s livelihoods. That’s a big responsibility. But whether the site breaks is not really the issue. The site is going to break now and then. We’re going to fix it. It’s about making sure the developers and others deploying code feel empowered and confident in what they’re doing and understand what they’re doing while they’re doing it.
So there wasn’t a Devops epiphany where you suddenly realized the answer to your problems. It emerged organically?
It was certainly organic. If development came up with better ideas of how to deploy faster, operations would be like, “OK, but let’s also add more visibility over here, more graphs.” And there was no animosity between each other. It was just making things faster and better and stronger in a lot of ways.
And as we did that the culture in the whole organization begin to feel better. There was no distrust between people. You’re really talking about building trust and building friendships in a lot of ways, relationships between different groups, where it’s like, “Oh, yeah. I know this group. They can totally do this. That’s fine. I’ll back them up, no problem.” In a lot of organizations I’ve worked for in the past it was like, “These people? Absolutely not. They can’t do that. That’s absurd.”
And you have to remember this is in the early days where the site breaks often. So it was one of those things, like, OK, if it breaks, we fix it, but we want reliability and sustainability and uptime. So in a lot of ways it was a big leap of faith to try to create trust between each other and faith that other groups are not going to impact the rest of the people.
A lot of that came from the leadership of the organization as well as the teams themselves believing we could do this. Again, we weren’t an IBM. We were a small shop. We all sat very close to one another. We all knew when people were coming and leaving so it made it relatively easy to have that kind of faith in one another. I can’t recall a time where someone walked in and said, “Oh my God, that person deployed this and broke the site.” That never happened. People checked their egos at the door.
I was going to ask you about the physical proximity of folks. So the various teams were already sitting cheek by jowl?
In the early days we had people on the left coast and on the right coast, people in Minnesota and New York. But in 2009 we started to realize we needed to bring things back in-house to stabilize things, to make things a little more cohesive while we were creating those bonds of trust and faith. So if we had a new hire we would hire them in-house. It was more of a short term strategy. Today we are more of a remote culture than 2009.
But you didn’t actually integrate the development and operations teams?
In the early days it was very separate but there was no idea of separation. Depending upon what we were working on, we would inject ourselves into those teams, which led later to this idea of what we call designated operations. So when John Allspaw, SVP of operations and infrastructure, came on in 2010, we were talking about better ways to collaborate and communicate with other teams and John says, “We should do this thing called designated operations.”
The idea of designated ops is it’s not dedicated. For example, if we have a search team, we don’t have a dedicated operations person who only works on search. We have a designated person who will show up for their meetings, will be involved in the development of a new feature that’s launching. They will be injecting themselves into everything the engineering team will do as early as possible in order to bring the mindset of, “Hey, what happens if that fails to this third-party provider? Oh, yeah. Well, that’s going to throw an exception. Oh, OK. Are we capturing it? Are we displaying a friendly error for an end user to see? Etc.”
And what we started doing with this idea of designated ops is educate a lot of developers on how operations works, how you build Ganglia graphs or Nagios alerts, and by doing that we actually started creating more allies for how we do things. A good example: the search team now handles all the on-call for the search infrastructure, and if they are unavailable it escalates to ops and then we take care of it.
So we started seeing some real benefits by using the idea of this designated ops person to do cross-team collaboration and communication on a more frequent basis, and that in turn gave us the ability to have more open conversations with people. So that way you remove a lot of the mentality of, “Oh, I’m going to need some servers. Let me throw this over the wall to ops.”
Instead, what you have is the designated ops person coming back to the rest of the ops team saying, “We’re working on this really cool project. It’s going to launch in about three months. With the capacity planning we’ve done it is going to require X, Y and Z, so I’m going to order some more servers and we’ll have to get those installed and get everything up and running. I want to make everybody aware I’m also going to probably need some network help, etc.”
So what we started finding was the development teams actually had an advocate through the designated ops person coming back to the rest of the ops team saying, “I’ve got this.” And when you have all of your ops folks integrating themselves into these other teams, you start finding some really cool stuff, like people actually aren’t mad at developers. They understand what they’re trying to do and they’re extremely supportive. It was extremely useful for collaboration and communication.
So Devops for you is more just a method of work.
Correct. There is no Devops group at Etsy.
How many people involved at this point?
Product engineering is north of 200 people. That includes tech ops, development, product folks, and so on.
How do you measure success? Is it the frequency of deployments or some other metric?
Success is a really broad term. I consider failure success, as well. If we’re testing a new type of server and it bombs, I consider that a success because we learned something. We really changed over to more of a learning culture. There are many, many success metrics and some of those successes are actually failures. So we don’t have five key graphs we watch at all times. We have millions of graphs we watch.
Do you pay attention to how often you deploy?
We do. I could tell you we’re deploying over 60 times a day now, but we don’t say, “Next year we want to deploy 100 times a second.” We want to be able to scale the number of deploys we’re doing with how quickly the rest of the teams are moving. So if a designated ops or development team starts feeling some pain, we’ll look at how we can improve the process. We want to make sure we’re getting the features out we want to get out and if that means we have to deploy faster, then we’re going to solve that problem. So it’s not around the number of deploys.
I presume you had to standardize on your tool sets as you scaled.
We basically chose a LAMP stack: Linux, Apache, MySQL and PHP. A lot of people were like, “Oh, I want to use CoffeeScript or I want to use Tokyo Cabinet or I want to use this or that,” and it’s not about restricting access to languages, it’s about creating a common denominator so everyone can share experiences and collaborate.