Who do you blame when IT breaks?

Assessing fault in data center incidents may pit internal IT staff against their vendors

There's always a reason why things break in IT, and the powers-that-be can usually find someone to blame -- be it a data center operations staff member, an OEM, a systems integrator or a third party service provider.

An offender often leaves clear fingerprints showing that a component was mislabeled or a process wasn't updated. In other cases, an incident may be the result of oversights by multiple parties.

But with the possible exception of a meteor strike, there's always someone to blame for a data center problem.

The majority are blamed on outside parties such as contractors or vendors, with a sizeable percentage of fault assigned to data center operations staff, according to data compiled by the Uptime Institute.

The findings of the Uptime Institute, which has been collecting incident data from its data center customers since 1994, may draw criticism as few internal IT operators or their vendors take blame easily.

Vendors may be blamed most because they are usually willing to take a bullet for a problem even if they feel the genesis is an internal operations oversight.

"The vendor gets caught up in a sensitive spot," said Ahmad Moshiri, director of power technical support at Emerson Network Power Liebert Services, because it doesn't want to put the client - a facilities manager - in a difficult position. It's very touchy," he said.

Uptime Institute members -- data center managers from multiple industries -- agree to voluntarily report abnormal incidents. The institute has about 5,000 abnormal incidents in its database. Such incidents are defined as any event in which a piece of equipment or infrastructure component did not perform as expected.

The data compiled by Uptime found that 34% of the abnormal incidents in 2009 were attributed to operations staff, followed by 41% in 2010, and 40% last year.

External forces who work on the customer's data center or supply equipment to it, including manufacturers, vendors, factory representatives, installers, integrators, and other third parties were responsible for 50% to 60% of the incidents reported in those years, according to Uptime.

Some 5% to 8% of the incidents each year were tied to things like sabotage, outside fires, other tenants in a shared facility and various odd anomalies.

About 10% of all the reported abnormal incidents resulted in an outage ranging from a system losing power to a data center going out.

The Uptime data shows that internal staff are responsible for a majority (60%) of those incidents, which can include outages and data loss incidents.

Although the internal staff gets the blame, "it's the design, manufacturing, installation processes that leave banana peels behind and the operators who slip and fall on them," said Hank Seader, managing principal research and education at Uptime.

To Seader's point about banana peels, David Filas, a data center engineer at healthcare provider Trinity Health described a situation where a fire system vendor, performing routine maintenance on a fire suppression system in one data center, triggered an emergency power off (EPO).

Ordinarily, this would not have been a problem, but an error in the construction of the EPO circuit let the signal through, which resulted in an outage. It turned out that the EPO bypass circuit was not constructed to the as-built drawing when the center was built years earlier.

"The designs and actions of engineers, architects, and installation contractors can have latent effects on operations long after construction," said Filas.

1 2 Page
FREE Computerworld Insider Guide: Five IT certifications that won’t break you
Join the discussion
Be the first to comment on this article. Our Commenting Policies