All systems down

1 2 Page 2
Page 2 of 2

To fix the problem, the CAP team decided to put a Cisco 6509 router between the core network and PACS, eliminating spanning tree protocol and its seven-hop limitation. (The 6509 also has switching capabilities, so the team decided to kill three switches inside PACS and use the 6509 for that too.)

Soon after 9 p.m., a Boeing 747 with a Cisco 6509 on board left Mineta International Airport in San Jose bound for Boston's Logan International Airport.

The local CAP team spent the night rebuilding the PACS network, a feat Halamka talks about with a fair bit of awe: The first time around, PACS took six months to build.

After working through the night, the team was momentarily disheartened Friday morning to see that, despite PACS being routed, the network was still saturated. But they rebooted Libby030 and another core switch, which brought out the smiles.

"We rebooted and things looked pretty," Halamka says.

Friday: Back to paper

By 8 a.m., the network started to flap again.

At 10 a.m., Halamka and COO Epstein decided to shut down the network and run the hospital on paper. The decision turned out to be liberating.

"We needed to stop bothering the devil out of the IT team," says Epstein.

Shutting down the network also freed Sands and the hospital's clinicians. Some had already given up on the computers but felt guilty about it. But "once the declaration came that we were shutting down the network, we felt absolved of our guilt," Sands recalls.

The first job in adapting to paper is to find it: prescription forms, lab request forms. They had been tucked away and forgotten. And many of the newer interns had never used them before. On Friday, they were taught how to write prescriptions. When Sands had to write one, it was his first in 10 years at CareGroup. "When I do this on computer, it checks for allergy complications and makes sure I prescribe the correct dosage and refill period. It prints out educational materials for the patient. I remember being scared. Forcing myself to write slowly and legibly."

At noon, Epstein came in to lend a hand ... and walked into 1978. Epstein worked the copier, then sorted a three-inch stack of microbiology reports and handed them to runners who took them to patients' rooms where they were left for doctors. (There were about 450 patients at the hospital.)

In time, the chaos gave way to a loosely defined routine, which was slower than normal and far more harried. The pre-IT generation, Sands says, adapted quickly. For the IT generation, himself included, it was an unnerving transition. He was reminded of a short story by the Victorian author E.M. Forster, "The Machine Stops," about a world that depends upon an ?ber-computer to sustain human life. Eventually, those who designed the computer die and no one is left who knows how it works.

"We depend upon the network, but we also take it for granted," Sands says. "It's a credit to Halamka that we operate with a mind-set that the computers never go down. And that we put more and more critical demands on the systems. Then there's a disaster. And you turn around and say, Oh my God."

Halamka had become an ad hoc communications officer for anyone looking for information. Halamka was the hub of a wheel with spokes coming in to him from everywhere—the CAP team, executive staff, clinicians and the outlying hospitals. Halamka leaned on his emergency room training at the Harbor-UCLA Medical Center in Los Angeles during the height of gang violence in the '90s. Rule one: Stay calm and friendly.

"But I'll be honest, 48 hours into this, with no sleep, the network's still flapping, I had a brave face on, but I was feeling the effects," Halamka recalls. "I was feeling the limitations of being a human being. You need sleep, downtime. You need to think about shifts, or humans will despair."

He found himself dealing with logistics that had never occurred to him: Where do we get beds for a 100-person crisis team? How do we feed everyone? He improvised.

"You don't know the details you're missing in your disaster recovery plan until you're dealing with a disaster," he says. For example, the paper plan was, in essence, the Y2K plan. Besides the fact that it was dated, it didn't address this kind of disaster.

Recovery plans are usually centered on lost data or having backups for lost data, or the integrity of data. At Beth Israel Deaconess, the data was intact. It was just inaccessible.

That led to Halamka's chief revelation: You can't treat your network like a utility.

"I was focusing on the data center. And storage growth. After 9/11, it was backup and continuance. We took the plumbing for granted. We manage the life cycle of PCs, but who thinks about the life cycle of a switch?"

This is a valuable insight. Networks indeed have gotten less attention than applications. But at the same time, Callisma's Rusch says, he hadn't seen a network as archaic as Beth Israel's in several years. "Many have already gotten away from that 1996 all-switched model," he says. "There are probably a couple of others like this out there."

Others agree with Rusch's assessment. "I think the danger is people start thinking the whole health-care IT industry is flawed and a train wreck waiting to happen," says the CIO of another hospital. "It's not. We all watched the heroic effort they made over there, but we're not standing around the water cooler talking about how nervous we are this will happen to us. We've had these issues. They scared us enough a few years ago that we took care of the architecture problem."

Halamka retreated to his office late Friday night. He lay down on the floor, pager in one hand, cell phone in the other, and fell asleep for the first time in two days. Two hours later, his cell phone rang.

Saturday: Helplessly hoping

Half awake, Halamka heard a staffer tell him they had found two more spanning tree errors, one at a facility called Research North and one in cardiology. Both had eight hops, one too many. They planned to cut the redundant links and move the traffic to the core network.

No one knew for sure how severely this would tax poor Libby030 and its counterparts. The team decided to build a redundant core with routing infrastructure as a contingency plan that would bring CareGroup out of 1996 and into 2002 in terms of its network.

At 8 a.m., two more Cisco 6509 routers (with switching capabilities) arrived from San Jose. Three hours before that, a trio of Cisco engineers from Raleigh, N.C., landed in Boston. They spent all day building a redundant network core.

Sands felt uncomfortable doing rounds that morning. "Patients sort of expect you to know their histories," he says. "But without that dashboard of information I'd get from the computer, I had to walk up to patients I had treated before and ask basic questions like, What allergies do you have? Even if I thought I remembered, I didn't trust my memory. It was embarrassing, and I was worried."

Progress on the network was slow. No one wanted to suggest that the current tack—building the redundant network core while severing the redundant links—was definitely the answer. At 9 a.m., Halamka met with senior management, including CareGroup CEO Paul Levy. "I can't tell you when we'll be functioning again," Halamka confessed.

Admitting helplessness is not part of Halamka's profile. "You never catch John saying, I'm scared, or I messed up," says one of his peers from the Health Data Consortium. "This had to be hard for him."

"When John told us he couldn't tell us when we'd be up, we stopped having him as part of our twice-a-day reports," Epstein recalls. The intent was to free Halamka from his communications duties so that he could focus on the problem. But Epstein was also becoming frustrated. He recalls thinking that "we didn't want to keep sending out memos to the staff that said, Just kidding, we're still down."

"If I had felt, in the heat of the battle, that someone could have done a better job than me, if I felt like I was a lesion, then I would have stepped aside," Halamka says. "At no time did I think this, and at no time was I fearful for my job. Am I personally accountable for this whole episode? Sure. Absolutely. Does that cause emotional despair? Sure. But I had to fix it."

Saturday night, with the redundant core in place, Halamka turned on the network. It hummed. There was clapping and cheering and backslapping among the team, which had grown to 100. Halamka passed around bottles of Domain Chandon champagne that his wife had bought at Costco. Then he went home.

At 1 a.m., his pager woke him.

Another CPU spike.

Sunday: And on the fifth day, Halamka rested

The problem was simple: A bad network card in RCB, one of the core switches. They replaced the card. Halamka went back to sleep.

Beep. 6 a.m. This time, it was a memory leak in one of the core switches. The CAP team quickly determined the cause: buggy firmware, an arcane VLAN configuration issue. They fixed it.

All day, the team documented changes. Halamka refused to say the network was back, even though it was performing well. "Let us not trust anyone's opinion on this," he recalls thinking. "Let us trust the network to tell us it's fine by going 24 hours without a CPU spike."

Monday: Back in business

Halamka arrived at his office at 4 a.m., nervous. He launched an application that let him watch the CPU load on the network. It reads like a seismograph. Steep, spiky lines are bad, and the closer together they are, the nastier the congestion. At one point on Thursday, the network had been so burdened that the lines had congealed into thick bars.

Around 7:30 a.m., as the hospital swung into gear, Halamka stared at the graph, half expecting to see the steep, spiky lines.

They never came. At noon, Halamka declared "business as usual." The crisis was over. It ended without fanfare, Halamka alone in his office. The same way it had started.

1by1.gif

Taking Action

Beth Israel Deaconess CIO John Halamka learned two critical lessons from his four-day disaster.

Lesson 1

Treat the network as a utility at your own peril.

Actions taken:

1by1.gif

1. Retire legacy network gear faster and create overall life cycle management for networking gear.

1by1.gif

2. Demand review and testing of network changes before implementing.

1by1.gif

3. Document all changes, including keeping up-to-date physical and logical network diagrams.

1by1.gif

4. Make network changes only between 2 a.m. and 5 a.m. on weekends.

Lesson 2

A disaster plan never addresses all the details of a disaster.

Actions taken:

1by1.gif

1. Plan team logistics such as eating and sleeping arrangements as well as shift assignments.

1by1.gif

2. Communicate realistically—even well-intentioned optimism can lead to frustration in a crisis.

1by1.gif

3. Prepare baseline, "if all else fails" backup, such as modems to query a network and a paper plan.

1by1.gif

4. Focus disaster plans on the network, not just on the integrity of data.

This story, "All systems down" was originally published by CIO.

Copyright © 2003 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
  
Shop Tech Products at Amazon