What, IT Worry?

Last Tuesday, Craigslist vanished from the Internet. So did LiveJournal and Technorati. CNET.com and Second Life were reportedly gone for a while too. What happened? The data center they all shared went dark because of a power failure. Simple enough, right? Except that the main point of using that data center was so theyd never have to worry about power failures.

See, a major marketing feature of 365 Main, the humongous San Francisco collocation facility that failed last week, is that it offers power that just wont quit. When power from the local utility goes out, a bank of 10 3,000-horsepower diesel generators is supposed to kick on automatically and keep running until stable power is restored for days, if necessary.

In fairness to 365 Main, it always worked that way in the past.

But not last week. Early Tuesday afternoon, external electric power started fluctuating wildly. A nearby underground transformer exploded. Power went out for a large section of downtown San Francisco, including the Financial District up to 50,000 customers in all.

And for reasons that 365 Main is still investigating, some of its backup generators didnt fire up as they should have. It took about 45 minutes for on-site engineers to start the generators manually.

By then, the damage was done for Craigslist, LiveJournal and the others between 20% and 40% of 365 Mains customers. Their servers went down hard. And instead of the magically continuous service their businesses had counted on, those servers had to be brought back up the hard way, slowly and carefully.

The lucky ones were offline for only a few hours. But even for them, the magic was gone.

It should be gone for the rest of us, too. Its time to accept some hard reality.

Bad things happen. They happen no matter how carefully we plan for them, because we cant plan for everything. They happen no matter who weve paid to take on the job of handling those bad things, no matter how much weve paid, no matter what promises weve been given.

Collocation and outsourcing dont work at least not if what we expect them to do is solve our business continuity problems.

They wont do that. They cant. We shouldnt expect them to.

In fact, we should assume that they wont, and plan accordingly.

Thats true even if a company like 365 Main brags that its power cant go down. It can. Murphy willing, it will. And nothing 365 Main does after the fact can make whole the lost sales, lost customers and lost confidence that come in the wake of that failed boast.

So, is outsourcing always the wrong move? Of course not. Trusting outsourcers thats the wrong move.

We have to believe theyll do their best. Other­wise, we shouldnt be doing business with them. But we also have to know that theyre not perfect, no matter what their brightly colored brochures say.

We can hand off work, but we cant hand off responsibility for our companys IT functions. Thats still ours.

Which means we cant outsource sleepless nights. We cant quit developing what-if scenarios and contingency plans. We cant stop looking for ways to backstop our vendors bulletproof services just in case a bullet somehow gets through.

When it comes to reliability, worry is good. Trust? Not so much.

One of the 365 Main customers, online retailer RedEnvelope, had the right idea. RedEnvelope maintained a backup data center in Cincinnati to avoid the results of just the sort of problem that struck last week.

But after two years without a glitch in San Francisco, 365 Main issued a press release announcing that RedEnvelope had shuttered the Ohio facility.

That was Tuesday morning. That afternoon, RedEnvelope was offline.

Frank Hayes is Computer­world's senior news columnist. Contact him at frank_hayes@ computerworld.com.

6 tips for scaling up team collaboration tools
  
Shop Tech Products at Amazon