With recent outages, big data service providers take hits

Are recession-induced cutbacks killing reliability?

When eBay Inc.'s PayPal unit suffered an outage Aug. 3, one of the companies affected was Sailrite Enterprises Inc., a sailing supply company in Churubusco, Ind. Sailrite lost its customer payment services for six hours.

The next day, PayPal's services failed again -- this time, for what seemed like an hour, according to Matt Grant, Sailrite's vice president. PayPal, in an e-mail, blamed the outage on a "back-end router" complicated by a failure in its redundancy measures. Grant was not amused. He posted a blunt message on PayPal's blog: "This is not acceptable."

It's not just PayPal. Other big providers of Internet-based services have seen outages, especially over the past two months. What's uncertain is whether this is a sign of a systemic problem, a run of bad luck or business as usual. Could it be that the reliability of core services is suffering along with the economy? Or is the increasing dependency on external cloud providers, coupled with near-instantaneous Twitter reports of outages, just making what's normal seem worse?

The answers aren't clear, but there are numerous signs of trouble.

Last month, Texas-based Rackspace Hosting Inc. said it would issue up to $3.5 million in service credits to customers after a data center in Dallas was hit by two separate power outages in June and July. Backup generators failed after one outage, and connectivity was lost in the other. Also in July, Google Inc.'s App Engine suffered an outage, with systems exhibiting "elevated latency and error rates." That incident appeared to last about four hours, according to Google's staff updates.

That wasn't Google's only outage this year. In February a "routine maintenance event" in a European data center knocked out Gmail for more than two hours.

The same week PayPal had trouble, a data center operated by a Houston-based hosted services provider known as The Planet was knocked offline for an hour. The problem, explained in a Twitter post by the company, was this: "There was a utility power drop, and the automatic cut-over to our UPS systems did not occur."

Other outages at smaller data centers got public attention, too: a Texas government system had problems; so did a data center in Seattle and hosted services provider Site5.com. More recently, Twitter and other sites were hit by distributed denial-of-service (DDoS) attacks.

Finally, Cisco Systems Inc.'s Web site was down for two hours earlier this month because of human error.

Kurt Roemer, chief security strategist at Citrix Systems Inc., said he sees more evidence of internal mistakes and wonders, for instance, whether Cisco's Web site outage "would that had happened a few years ago... when they had multiple people checking every single change?"

If outages are becoming more frequent, the economy may be at fault. The Association for Computer Operations Management (AFCOM), reported in December that half of all data centers it surveyed were planning cuts, and nearly 12% of the survey respondents said they believed service disruptions would increase.

Another warning sign comes from Uptime Institute data. The Santa Fe, N.M.-based data center engineering and consulting firm issues what it calls Flash Reports to its members when it sees a data center experiencing failures that could occur at other sites with the same kind of hardware. That hardware includes circuit breakers, batteries and UPS systems.

In all of 2008, Uptime sent out six Flash Reports, according to Ken Brill, Uptime's executive director. So far this year, it has sent out 17 reports detailing equipment problems and it has four others pending. Brill isn't sure what's causing the uptick, but he believes it's significant.

The drive for energy efficiency may be prompting data centers to cut back on redundant equipment and run their systems harder, exposing equipment flaws that may have been there all along, said Brill. Cutbacks are another possibility. "We're not doing the maintenance we should be doing, and when you don't do maintenance, you increase the probability of catastrophic failure."

Ted Maulucci, CIO at real estate developer Tridel Corp. in Toronto, doesn't see a systemic problem, even though he had to deal with an outage by a data center provider. He believes fiber-based connectivity is improving performance and stability. "Five years ago, it was not uncommon to experience the odd interruption, whereas today it has been pretty rock solid, other than the major failure that happened," he said.

Neal Puff, the CIO of the Yuma County, Ariz., government, moved ERP systems that had been hosted by an Internet-connected service back in-house last year. His goal: to improve reliability and performance. Puff estimated the county will save $1 million over five years by running the systems itself. Even so, he said he believes that "a well-run hosted solution with reliable connections will perform well and be as reliable as any well-run in-house system."

Leslie Daigle, the chief Internet technology officer of the Internet Society, said there have always been significant issues and outages online. Indeed, there was concern in the mid to late 1990s that congestion would bring the Internet to its knees. But that never happened. "The Internet is in a state of constant evolution, and that really does provide its overall resilience," she said.

There is more investment than ever in reliability and redundancy, "and for the most part it shows," said Jose Nazario, manager of security research at Arbor Networks Inc. in Chelmsford, Mass. "Given the loads that the network takes, its highly dynamic structure day to day and the fragility of its components, it's quite stable -- maybe not yet dial-tone reliable, but pretty good."

Mike Hrabik, chief technology officer at Solutionary Inc., a managed network security provider in Omaha, said he believes that companies need to look more at the risk they may be injecting into systems. "The projects aren't slowing," he said, but the resources needed "to deploy, test and continually monitor are going down."

FREE Computerworld Insider Guide: IT Certification Study Tips
Join the discussion
Be the first to comment on this article. Our Commenting Policies