Who do you blame when IT breaks?

Assessing fault in data center incidents may pit internal IT staff against their vendors

1 2 Page 2
Page 2 of 2

Filas believes that "outside forces can make or break the data center just as easily as internal forces." But he also sees risk levels rising, particularly as data centers rely more on external suppliers.

Electrical contractors, for instance, may not understand the specific needs of a data center. "We are frequently questioned on why we provide redundant power to racks," said Filas.

Jeff Pederson, manager of data recovery operations at Kroll Ontrack, looks at the root causes of data loss and sees problems caused by both internal staff and external providers. But, he added, service people attempting to get equipment up and running "tend to cause a lot of the damage we see."

"The sole goal [of some service techs] is to get that equipment working and operational; it is not necessarily to protect the data that the customer has," said Pederson.

Kroll said the end result of such attitudes often leads to this complaint from users: "My system works now but my data is all gone."

Data losses and outages are about the worst things that data centers can deal with.

In most years, Uptime members reported about two dozen outages; last year the number declined to seven.

The drop in outages coincided with the lowest level of data center equipment installations since 2008, said Seader. He also credits an improved focus on processes and procedures by the reporting companies.

Emmerson's Moshiri cites process and procedural issues as a leading cause of problems, particularly when multiple vendors are involved and a high degree of coordination is needed.

Oftentimes critical pieces of information such as power diagrams or even the physical location of equipment may be out of date and incomplete, said Moshiri.

Maintenance is another issue, said Moshiri.

Facility managers may disregard an OEM's recommendation that maintenance on a particular device be conducted, for instance, twice a year.

Steve Fairfax, president of MTechnology, applies Probabilistic Risk Assessment (PRA), which is used in the nuclear industry, on IT equipment. The study concluded that too much maintenance is a major source of problems.

The PRA model uses all the data they know about individual components and then combines them in a mathematical model that represents how the entire system works, whether that system is a nuclear power plant or a data center.

Fairfax says his mathematical models makes the case that the amount of maintenance in data centers "is grossly excessive by a factor of 10" and is responsible for a great deal of downtime.

"Messing with perfectly functioning equipment is highly profitable," said Fairfax.

Fairfax said if you want to take data centers to the next level of reliability and have them crash as infrequently as airplanes, "then we have to do the same things that jet airplanes do," and train data center operators in simulators.

It also means developing different maintenance criteria. "More is not always better because when you do maintenance on an airplane that means taking it apart and when you take it apart you can sometimes put it back together wrong," said Fairfax.

Patrick Thibodeau covers SaaS and enterprise applications, outsourcing, government IT policies, data centers and IT workforce issues for Computerworld. Follow Patrick on Twitter at @DCgov, or subscribe to Patrick's RSS feed . His e-mail address is pthibodeau@computerworld.com.

Copyright © 2012 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon