Metapack's IT ops team puts its feet up on Black Friday

The logistics software platform has developed a fine-grained monitoring system, and exposes it back to clients, to build trust and cut down on issues

businessman relaxing stretching calm thoughtful 56515092
Thinkstock

Any business operating in the retail sector will have surely experienced some nerves around Black Friday and Cyber Monday – especially the thought of outages and downtime during the high-traffic online shopping events.

One such company is Metapack, the London-based software-as-a-service vendor which links retailers like Adidas, ASOS and Tesco, with a vast network of logistics firms and fulfilment options via a single API. It was purchased for £175 million and merged with US mail and shipping specialist Stamps.com in July 2018 and now delivers 1.2 million parcels a day, peaking at three to five million on days like Black Friday or Cyber Monday.

Steve Homan, CTO at Metapack walked into an organisation in 2018 where, he said: "We had some really good people who weren't being listened to. We had senior people who weren't engaged in the detail and were reacting to noise, rather than asking what the failure demand problem is".

With the acquisition process, Homan saw an opportunity to pause any product development and just focus on core operations and reliability engineering.

"So we just focused really hard on basic, quality information, quality engineering principles, doing brilliant basics really well, really consistently and relentlessly focusing on it," he told Computerworld during the New Relic Futurestack event in London last week. "What it allowed us to do was to get really clear on the small, medium and large problems we needed to solve. When you take some of the frenetic activity away you have more time and get better, it is a very virtuous cycle."

Preparing for peak

In practice this started with a renewed focus on quality engineering practices and a fine-grained approach to detail. This manifests most clearly in the new daily ops meeting, where "your worst case scenario is 24-hours away from deep expertise and we make decisions really quickly and go off an operate," Homan said.

Then Metapack wanted to start "preparing for peak" so that it was better equipped to meet customer expectations around key retail dates, from Black Friday all the way up to Christmas.

"Last year we said we are preparing for peak permanently," Homan said. "We are heavily automated and with that model once you have done the hard work you have solved it, you just have to keep going because we deploy constantly, so you can roll back really easily." Next on his agenda was removing or automating away lots of legacy kit, including some .NET stacks.

Now, using a set of New Relic monitoring tools and an increasingly public cloud-based infrastructure, Metapack is better able to respond to incidents and prepare for these peaks. The rest of the stack includes Nagios for system monitoring, and Pingdom for CPU and memory usage. This all now flows through an ops stack that includes PagerDuty, Slack and Zoom.

Lukasz Ciechanowicz, head of technical operations at Metapack wrote in an online case study: "With [New Relic] APM we can follow the path of sub services and focus in on particular pain points. This has allowed us to tune our application in exactly the right place and get the best out of our infrastructure. Alerts are now so detailed that we can direct them straight to the team best placed to fix them the fastest. It drastically reduces our time to detect and fix issues." Availability has improved from 99.4 to 99.96 percent, he added.

Now the operations team at Metapack is literally able to put their feet up on peak days.

For example: "On Black Friday morning a customer did an insanely large global fashion launch," Homan said. "The retailer phoned us, a little bit panicked, and we sent them a picture of a New Relic dashboard on a phone next to a coffee mug, a pair of feet on the desk, like: 'We're cool'."

During these peak periods senior IT staff at Metapack work shifts across a 24-hour clock. Recently Homan took on a key Saturday afternoon shift.

"I watched a football game, I made some music on my laptop, we had a curry in the office," he said. "There's record volumes all over the place and we were in control, that comes from having done the hard yards."

Coming armed with information

Now, Homan is looking to get closer to customers and build that trust in the platforms and metrics

The New Relic Insights tool was initially procured by Metapack to help it verify and meet various service-level agreement (SLAs) by collecting and delivering key metrics to managers. Now Homan wants to expose these insights straight back to the customer.

"We're confident in what we do, but also respectful of the fact that we sit in the heart someone's business and it's very, very dangerous if we get it wrong," Homan said. "We have built trust so we are able to have really great, deep, engaged conversations with the operators and the businesses. It's changed the conversations from transactional to more of a partnership, because you take a bit of a risk and there is respect there by showing this stuff."

Metapack tracks its net promoter score (NPS) with customers on an ongoing basis, and Homan claims it has improved from -27 to +30 on his watch.

For example, the IT ops team had been seeing an inconsistent slowdown on a key platform. By hunting through the monitoring tools they were able to diagnose the problem – certain configuration triggers some of the internal non-tech team were setting up were stacking up in an inefficient way – and go to that team to explain and resolve the issue.

"When we went and had that conversation with them it was amazing how simple it was. So rather than saying, we are production and we have the stick and everyone else gets out the way, we turned up, armed with information, on their terms and that creates the right conversation," Homan said.

AIops and the future

Far from seeing it as an industry buzzword, Homan says he is "obsessed" with the idea of AIops.

In terms of a working definition for AIops, Homan had one ready to go: "AIops means finding things that are fine-grained detail that are outliers and patterned over an extended period of operating time and showing me something I would have struggled to put together. So showing me a regular pattern is something we do today, but showing me a semi-irregular pattern that has occurred over time is very powerful."

Now he is busy working out how to achieve it. "We have been talking to the New Relic product team demanding it and saying: 'if you don't get it we will buy it from someone else' because in our game, and I love New Relic, but if I don't get what I need, I will go somewhere else and knit in the next [solution]," he said.

It's this shift towards better, faster root cause analysis that is driving the monitoring sector forward, as can be seen with the investment in machine learning and AIops at rival vendors like AppDynamics and Splunk.

Copyright © 2019 IDG Communications, Inc.

Download: EMM vendor comparison chart 2019