Navigating The Disaster Recovery Maze

Backup tapes, hot sites, annual tests -- the elements of yesterday's disaster recovery planning may lead to a dead end today. Do you know where your applications are?

Ah, the good old days. Planning for disaster recovery, if it occurred at all, was one of the easier things an IT manager had to do.

Youd back up your mainframe to tape every night or over the weekend. If you were really conscientious, youd send the tapes off-site and arrange for contingency processing at some other data center. Testing your recovery plan? Youd retrieve the tapes and see if you could read them.

Of course, things have gotten steadily more complicated over the years, with distributed and networked computers, n-tier computing, heterogeneous hardware and operating systems, virtualization, automated data feeds from external parties and more.

Adding to the confusion has been a steady change in the meaning of disaster. Ten years ago, a four-hour outage might not have even been noticed by users or customers; today, it could cost you your job.

As a result, it has become vastly more difficult to prepare and test disaster recovery plans, and increasingly unlikely that you will go to bed at night feeling 100% sure that all your IT assets are protected.

Companies are dealing with these challenges in various ways. Some are reaching out to external parties for help with disaster recovery planning and hot sites, to which computer processing can be moved quickly in an emergency. Others have pulled back from these arrangements, saying they can better handle the complexity of disaster recovery in-house. Still others are essentially redefining disaster recovery by substituting notions of disaster avoidance.

Jerry Grochow, CIO at MIT, illustrates the problem this way: I once counted a dozen different boxes that had to be up for [an application] to work from end to end, and thats not unusual. So you ask your SAP application programmer, Whats necessary to recover your system? and you dont necessarily get the full picture, because the programmer doesnt realize that the authentication server needs to be running so someone can even log on, and its running in a different data center.

Jerry Grochow

Jerry Grochow Not only are an organizations IT assets no longer all located in a cozy glass room with a raised floor, they may not even be under the control of the IT department. Grochow recalls an earlier job at a brokerage firm that got automated data feeds from 40 external suppliers, noting that some financial institutions have 100 such connections. How to recover a major data processing application when you have that many feeds is extremely complicated, he says.

The challenges are legion.

Schneider National Inc. in Green Bay, Wis., at one time contracted with a service provider for a disaster recovery hot site but recently decided to set up its own second data center to serve as a recovery facility. Ours is a very complex and highly integrated technology environment, says Paul Mueller, vice president of technology services at the trucking company, which has 36 locations in North America. As complexity has increased, so has the difficulty associated with hot-site recovery.

It proved difficult to accurately replicate Schneiders operating environment at the external facility, Mueller says, and so his semiannual disaster recovery tests were never completely satisfactory. Invariably, we encountered issues when we executed those tests, such as tapes not being correct, he says. Our ability to restore was problematic based on the hardware configurations, operating system configurations and so on.

Mueller says he is much more comfortable with his new arrangement, but it came at a stiff price. Schneiders two data centers are connected with redundant fiber-optic cables, redundant telephone systems and dual mainframes backing each other up. We have invested heavily based on the risk to the enterprise and to the supply chains that we help our customers manage, he says. But we felt this investment was absolutely the right way to go.

And the investment was not just in facilities. With the help of a consultant, Muellers staff interviewed 70 business managers and a few key customers. The interviews gleaned estimates of the losses that would result from various types and durations of outages, as well as managers recovery-time goals.

When you have that information consolidated into an assessment document and you get to see the aggregate impact to the business of losing your data center, it becomes a very compelling story, Mueller says.

Bob Dowd, CIO at Sonora Quest Laboratories LLC in Tempe, Ariz., says his company cant afford a fully redundant hot site for disaster recovery, but he has taken other steps aimed at avoiding a disaster. Sonora Quest runs medical tests for 20,000 patients every night and gets the results to doctors by early the next morning, so its not hard to imagine the effect that a prolonged outage in its highly automated processes would have on the business. We have hardened the computer room and built in all kinds of redundancy, so if one node fails, we have immediate fail-over to another node, Dowd says. The Tempe data center has redundant disks, two network cores and no single points of failure. Plus, it does two backups a day, one to a server and another to tapes that are taken off-site.

Still, Dowd worries about the data center, which sits near the end of a runway at the Phoenix airport. Hed like the safety of a remote backup facility, and he has an idea for getting one on the cheap.

Part of the Tempe data center is devoted to serving as a test environment for the labs systems effectively a scaled-down duplicate of the production environment. If that were moved to Sonora Quests lab in Tucson, Ariz., it could be used as a backup for Tempe, Dowd reasons. Wed be using it to save the business, not necessarily doing upgrades, he explains.

Bob Dowd

Bob Dowd Virtual Headaches

Rod Flory, CIO at Lennox International Inc. in Richardson, Texas, says the heating and cooling system company has been rolling out server virtualization software to increase the efficiency and flexibility of its servers. But that has complicated disaster recovery planning, he says.

With VMware, we are changing our server platforms more frequently not adding servers, but changing memory, the number of CPUs in them and so on, Flory says. So quarter to quarter, our environment looks different, and keeping up with that on the hot site is a challenge.

Flory says he tests his disaster recovery plan religiously once a year, and its not a trivial effort. Its a project, he says. I take five people and set them aside for a few weeks.

The tests run smoothly enough, Flory says, but hes considering involving a disaster recovery firm in a future test. You look at situations like the bird flu. You are counting on five or six people who know how to execute the plan, but what if they are not available? he says. Can your plan be scripted well enough that you could hire a consulting group, give them the book and say, Here, execute the plan?

And theres another improvement Flory wants to make. Traditionally, Lennoxs systems have been centralized at company headquarters, but more recently, functions such as e-mail and computer-aided design have been pushed out to servers at manufacturing sites where there are no disaster recovery capabilities.

But including those remote sites in the centralized plan is not simple because they dont have standard systems at the sites. We are dealing with a legacy of autonomous decision-making, Flory says. We may have Dell servers at one facility and IBM at the next. So you look at 15 to 20 major facilities, and you realize you dont have a common architecture.

He says Lennox will try to move the remote sites to a more common architecture so the central data center can serve as a hot site for them but that could take years.

Meanwhile, MIT is supplementing its two on-campus data centers with two additional leased facilities one a few miles away and the other many, many miles away, says Grochow. But these will not be traditional disaster- recovery sites. All four will be in use all the time, with each critical application running at at least two of them. The four centers in total will not have a great deal of excess capacity or redundant equipment, so they will not be prohibitively expensive, Grochow says.

With this setup, the difficulty of testing a disaster recovery plan almost disappears. Because every site is running all the time, and because each critical application is running in more than one place, the plan is essentially tested every day, Grochow says. The idea is to always be in a fail-soft mode. If you have an architecture that allows certain things to be down, you are never completely out of business, he explains. But if your architecture has lots of single points of failure, you have to have a very detailed recovery plan.

The concept of disaster recovery as we knew it is changing, Grochow says. I think we have gotten past the point where you can rely on a third party to provide hot-site recovery, because it has gotten too complicated. Related News:

Computerworld's IT Salary Survey 2017 results
Shop Tech Products at Amazon