How Segment went from monolithic to microservices and back again

When customer data startup Segment ran into problems with its cumbersome old infrastructure, like many other businesses it turned to microservices - but soon ran into a web of complexity that could only be untangled by returning once again to a 'monolithic' architecture.

Startup Segment is used by blue-chip customers like Time, IBM, and Levi's. It promises to allow businesses to view all of the customer data they own in one spot, before feeding them downstream for sales, analytics, helpdesk and more.

"We started with a developer-first approach where we give people a single API, we give them some libraries, they can send in data," explains CTO and cofounder Calvin French-Owen. "Then we help synthesize the data, organise it, put it into the proper schema - and then help them adapt it to over 200 different tools they might be using, whether that's a sales tool like Salesforce, a customer success tool like Zendesk, or a data warehouse like Redshift or BigQuery."

The business is running entirely on AWS, with 16,000 containers managed by the Elastic Container Service (ECS), encompassing 250 different microservices.

When the company first launched it was running everything on a monolithic architecture. The API digested the data, forwarded it to a single queue, where a worker read that data, and sent the event on to every desired server-side 'destination' - a partner API - one after the other - in a linear chain.

Read next: What are microservices?

The team quickly noticed that with this approach, if a tool was returning a server error the retry attempts would rejoin the queue and get mixed up with other events - essentially clogging the pipe and creating performance problems.

"That was our original inspiration for breaking up that monolith into all these microservices that we had," explains Alex Noonan, the software engineer who led the project.

"Then, we had an API that ingested the data, and then we had a router - and the router would rout the event to the destination-specific queue and service. So we'd have an event come in, it would see, 'OK, this event needs to go to Google Analytics and Mixpanel', it would make two copies of that event and send it to each one of those queues."

Sleepless nights in Microservices World

Microservices worked well for a time. The problems arose when deep into "microservices world," as Noonan puts it, new types of queues formed for each individual partner API destination and service - and the developers quickly found themselves drowning in heaps of complexity, having to manually address every update and losing track of which versions were in what repos and where.

"As time went on we noticed our developer productivity was suffering, because what had happened was everything was in a separate queue, a separate service, as well as its own repo," says Noonan. "We wrote these shared libraries to help us maintain and build all these integrations but we didn't have great testing around them or a good way to deploy them.

"When we wanted to make an update to the shared library, we were strapped for time and resources - so we'd only update the version in, say, Google Analytics. And now everything else was on a different version."

The team had to keep track of which version of the library each of the partner APIs was running on, and the differences between those versions.

"Maintenance started to become a pretty big headache for developers. If we did want to suck it up and make the change to every single one of those services, we had to spend multiple days, multiple developers testing and deploying every single one of those services - it became a huge burden, so we ended up opting for not making changes that we'd desperately need.

"Or we would have to suck it up and know it would take probably about a week and a whole team effort to get a tiny change out to every single service."

If the lost sleep from addressing those queues wasn't bad enough, Segment was also noticing performance issues cropping up again - for example, the larger destinations like Google Analytics were processing thousands of events per second, but some others would only process a handful a day.

The team introduced auto-scaling rules to reduce the manual customisation for those services, but each of them had a distinct load pattern for CPU and memory, so the rules just didn't work for all of the integrations.

"We were constantly getting paged and had to manually step in and scale up some of these services," says Noonan. "After about two years of that microservices setup we had 140 of these different services in queues and repos, and we couldn't make any headway, we were struggling to keep the system alive.

"We had to take a bit of step back and say: 'what are we going to do to fix this?' We would have had to add more bodies to the problem, which wasn't really something we wanted to do."

Noonan tells us there wasn't a singular tipping point, as far as she is aware, for the microservices architecture being untenable, but the performance slowdowns suffered as they added destinations was a "pretty big red flag".

"I joined the team at the peak of the microservices craziness and discussions were starting to happen around how we can solve it," she says.

Back to the monolith

As Noonan outlines in an extensive blog post, the team had to figure out how to rearchitect the microservices into one big functional system - a new monolith. They decided to roll it all into the Centrifuge infrastructure project the company was building - a single event delivery system that's become central to Segment's business.

"Centrifuge was this general purpose system to create all these queues and absorb traffic when there's failures, and as part of that shift, that's when we decided to move a bunch of this code all into one place - to consolidate a little bit more and kill two birds with one stone," explains CTO French-Owen.

"Given that there would be only one service, it made sense to move all the destination code into one repo, which meant merging all the different dependencies and tests into a single repo," Noonan writes in the blog. "We knew this was going to be messy.

"For each of the 120 unique dependencies, we committed to having one version for all our destinations. As we moved destinations over, we'd check the dependencies it was using and update them to the latest versions. We'd fixed anything in the destinations that broke with the newer versions."

Read next: How Spotify migrated everything from on-premise to Google Cloud Platform

Creating that consistency "significantly reduced" complexities among the codebase. The team also created a test suite that would allow it to quickly and easily run all the destination tests in one go - as this was one of the main problems that previously put off the team from updating.

Eventually when the destination code was all in a single repo they were merged into a single service, leading to immediate improvements in developer productivity - with service deployment times down to minutes.

"The transition took a bit of time understandably because we had to rework the biggest part of our infrastructure," Noonan tells Computerworld UK. "But, I don't think we've been paged - had to scale it up since - which has been amazing for everybody's sleep. Maintenance has become significantly easier with everything living in one repo.

"Now, if we want to make an update to these shared libraries on how they all behave, it takes one engineer one hour to test it, deploy it to every single one, which has been an absolute gamechanger for us.

"I think when we first made the switch it made a lot of sense, it was a good fit for where the team was at the time, and some of the performance issues we were dealing with then. But as time went on the benefits of the original switch started to reverse a bit and took a big toll on our productivity and performance so we moved back."

Although Segment will be somewhat unique in the sheer volume of data it organises, Noonan says that if other companies experienced similar pains she "wouldn't be surprised" if they moved back to monolithic structures as well, at least in certain cases.

It wasn't just the work-life balance of Segment's employees that improved. French-Owen says that the difference for customers is in the things they "don't notice now".

"Now that everything is all in one place with one single repo, we're able to make the change once and then ensure that the next time those are deployed it goes out everywhere, so I think customers see a lot less of these small inconsistencies that could come from drifting versions between different services."


Copyright © 2018 IDG Communications, Inc.

Where does this document go — OneDrive for Business or SharePoint?
Shop Tech Products at Amazon