Skip the navigation

Here's how Google and Amazon scale up their huge IT operations

As large IT systems scale to unforeseen levels of complexity, new laws of effective management come into play

By Joab Jackson
November 27, 2013 04:24 PM ET

IDG News Service - Internet giants such as Google and Amazon run IT operations that are far larger than most enterprises even dream of, but lessons they learn from managing those humongous systems can benefit others in the industry.

At a few conferences in recent weeks, engineers from Google and Amazon revealed some of the secrets they use to scale their systems with a minimum of administrative headache.

At the Usenix LISA (Large Installation Systems Administration) conference in Washington, Google site reliability engineer Todd Underwood highlighted one of the company's imperatives that may be surprising: frugality.

"A lot of what Google does is about being super-cheap," he told an audience of systems administrators.

Google is forced to maniacally control costs because it has learned that "anything that scales with demand is a disaster if you are not cheap about it."

As a service grows more popular, its costs must grow in a "sub-linear" fashion, he said.

"Add a million users, you really have to add less than a 1,000 quanta of whatever expense you are incurring," Underwood said. A "quanta" of expense could be people's time, compute resources, or power.

That thinking is behind Google's efforts not to purchase off-the-shelf routing equipment from companies such as Cisco or Juniper. Google would need so many ports that it's more cost-effective to build its own, Underwood said.

He refuted the idea that the challenges Google faces are unique to a company of its size. For one, Google is composed of many smaller services, such as Gmail and Google+.

"The scale of all of Google is not what most application developers inside of Google deal with. They run these things that are comprehensible to each and every one of you," he told the audience.

Another technique Google employs is to automate everything possible. "We're doing too much of the machines' work for them," he said.

Ideally, an organization should get rid of its system administration altogether, and just build and innovate on existing services offered by others, Underwood said, though he admitted that's not feasible yet.

Underwood, who has a flair for the dramatic, stated: "I think system administration is over, and I think we should stop doing it. It's mostly a bad idea that was necessary for a long time but I think it has become a crutch."

Google's biggest competitor is not Bing or Apple or Facebook. Rather, it is itself, he said. The company's engineers aim to make its products as reliable as possible, but that's not their sole task. If a product is too reliable -- which is to say, beyond the five 9's of reliability (99.999 percent) -- then that service is "wasting money" in the company's eyes.

Reprinted with permission from IDG.net. Story copyright 2014 International Data Group. All rights reserved.
Our Commenting Policies
Internet of Things: Get the latest!
Internet of Things

Our new bimonthly Internet of Things newsletter helps you keep pace with the rapidly evolving technologies, trends and developments related to the IoT. Subscribe now and stay up to date!