Ready for Trouble?

Faced with potential catastrophe caused by anything from the weather to a malicious attack, companies need to make sure their disaster recovery plans match best practices.

It was the Monday morning after the July 4th weekend. The power went out in the highest building in Philadelphia. Not to worry, the disaster recovery (DR) specialists had that one covered—the building had a connection to a separate part of the grid. But then the repair crew accidentally severed the backup connection.

Ready for Trouble?
1pixclear.gif
Image Credit: Polly Becker
1pixclear.gif

"Every disaster has a different face, so no one can accurately predict," says Nick Voutsakis, chief technology officer at Glenmede Trust Co., a wealth management firm whose headquarters occupies four floors of that building in Philly. "Your planning has to be flexible enough to cope."

Incidents like this one give businesses a chance to see their DR technology in action. While some companies pass with flying colors, the plans of others are exposed as incomplete, unrealistic and technologically flawed. So, what are the tried-and-true best practices, what technologies should be deployed, and how should IT cooperate with the organization as a whole in order to take all necessary precautions?

"Those companies with untested or poorly tested plans will eventually discover that they aren't as protected as they thought they were," says Mike Karp, an analyst at Enterprise Management Associates Inc. in Boulder, Colo.

Planning for the Unplanned

Some DR plans are too simplistic, don't mesh with the real world and have little value in an emergency. Others are complex tomes that nobody reads. According to Voutsakis, the trick is finding a balance.

But even companies with well-compiled plans can look foolish if nobody can find the plan when they need it. It's no good if it's lost in a binder or in a PC that's down because of the disaster. So keep copies of the plan in multiple locations.

"We include copies of our plan in the emergency packs we provide to employees containing food, medical supplies, flashlights and so on," says Voutsakis.

Glenmede is primarily a Windows 2000/XP shop that uses Cisco Systems Inc. switches and Dell Inc. servers and desktops. Its DR plan has several layers, depending on the situation. If people can't get to work because of excessive snow, the servers keep running at headquarters and the staff works securely from home. If the building's power goes out, the critical systems can be brought up within four hours at a "hot site" across town owned by business continuity services and outsourcing provider SunGard Availability Services Inc., a unit of SunGard Data Systems Inc. If an event keeps employees out of the building for a week, desktops for key personnel are standing by at SunGard.

During the Independence Day weekend outage, Glenmede's management declared an emergency at 7:30 a.m. Since all data is replicated to the hot site, the company had all systems running by 11.30 a.m. But it takes a well-oiled machine to pull that off smoothly. And that means teamwork.

"Form a business continuity program with a dedicated team of two to five people, with a senior management sponsor," advises Roberta Witty, an analyst at Gartner Inc. in Stamford, Conn.

Glenmede's primary DR committee consists of the CTO, the heads of office services and risk management, and an IT audit member. The committee appointed an extended business continuity group consisting of representatives of 20 business units. These people are trained in business continuity, write the plans and collaborate with their business units. The minutes of both committees' sessions are sent to Glenmede's board of directors.

Each business unit has to evaluate its processes and needs. At The Members Group Inc., a West Des Moines, Iowa-based company that provides card-processing and mortgage services to credit unions, the necessary recovery period varied widely by department and time of the month. Payroll, for instance, might be happy with a 13-day recovery window at the start of the payroll period and a 30-minute recovery on payday.

"You have to work with the business units to fully understand the drivers of each application," says Jeff Russell, CIO at The Members Group. It's impossible for a lone IT staffer to appreciate the particular needs of each department. The Members Group uses StoneFly Replicator, an IP storage-area network-based asynchronous disaster recovery product from San Diego-based StoneFly Networks Inc. to maintain a mirror image of critical data at a remote location.

State-of-the-Art Technology

While opinions vary as to what constitutes state-of-the-art technology, experts such as Karp of Enterprise Management Associates and Chip Nickolett, a disaster recovery specialist at Comprehensive Consulting Solutions Inc. in Brookfield, Wis., agree that clustering, SAN mirroring and replication are on the leading edge. However, they warn that these can be expensive technologies.

Among operating systems, OpenVMS and Unix seem to be favored more than others. Alpha/OpenVMS, for example, has built-in clustering technology that many companies use to mirror data between sites. Many financial institutions, including Commerzbank, the International Securities Exchange and Deutsche Borse AG, rely on VMS-based mirroring to protect their heavy-duty transaction-processing systems.

Deutsche Borse, a German exchange for stocks and derivatives, has deployed an OpenVMS cluster over two sites situated 5 kilometers apart. It also uses Fibre Channel switches from San Jose-based Brocade Communications Systems Inc. and Cisco switches and routers in its network to ensure high availability.

"DR is not about cold or warm backups, it's about having your data active and online no matter what," says Michael Gruth, head of systems and network support at Deutsche Borse. "That requires cluster technology which is online at both sites."

For its part, Windows has as many detractors as advocates. "While we've never failed to recover a Unix system, it's a different story with Windows," says Nickolett. "Common problems include failed restores, software conflicts and issues with patches or service packs."

Forbes.com Inc. in New York also favors platforms besides Windows. Each business day, it publishes more than 1,500 articles online, making heavy use of an advertising workflow system running on an Intel/Linux platform and a content management system hosted on high-end Fujitsu Ltd. servers that run Sun Solaris. Both are protected using the Continuous Protection System, an appliance from Revivio Inc. in Lexington, Mass. A Gigabit Ethernet line connects to a data center at an unspecified location using host-based mirroring technology. "We're able to switch to the appliance in the event that the primary system has a problem," says Michael Smith, general manager of operations at Forbes.com.

But not everyone agrees that Windows should be avoided. In fact, the Cancer Therapy & Research Center (CTRC) in San Antonio stakes its patients' lives on a combination of Microsoft Corp., EMC Corp. and Cisco tools for host-based mirroring. At the medical center, 21 servers—primarily Windows 2000/2003, plus a few Linux boxes—store data on an EMC Clariion FC4700 array. Two Cisco SN 5428 iSCSI routers and a Cisco MDS 9506 switch mirror data and large imaging files over a Gigabit Ethernet network to another Clariion array at the research center 22 miles away. According to Mike Luter, CTO at CTRC, it takes 10 minutes to recover a downed server and restore service.

"Business continuity is far more important to us than disaster recovery," says Luter. "We want our applications always available to our patients. If we lost the building, it would take a lot more than a few computer systems to be able to treat our patients elsewhere."

Testing Times

The finest technology and the most skillful planning are about as far as many companies go in DR, and that's nowhere near far enough. It takes testing galore to prepare for the real thing. "Failing to follow through with exercises to locate and correct plan deficiencies is a common error," says John Glenn, a business continuity consultant in Clearwater, Fla.

That doesn't mean an IT administrator "dummy-running" the plan over the weekend on his own, Glenn says. You should bring all systems down on a Sunday to see if the remote site operates as planned. And bring in a few dozen employees and run a live test to see how the business units are affected. Can finance continue accounting, sales keep selling and production continue to turn out products? In addition, surprise everyone with a few random exercises during the workweek, suggests Smith of Forbes.com.

"We test our entire plan seven times a year," says Glenmede's Voutsakis. "We evaluate our performance for different levels of disaster and various kinds of events, including sending staff home to see how well they can perform there." He says that the problems that can cripple you during an actual disaster show up only during real-world exercises.

That was the case at The Members Group. It thought it had plenty of bandwidth to replicate off-site. But its T1 lines proved inadequate. For example, its SQL database couldn't be adequately replicated because of bandwidth constraints, so it hasn't been transferred to the IP SAN. Similarly, more than half of the company's servers remain unmirrored. "We're moving our primary facility in May and will add more bandwidth at that time," says Russell.

A CRISIS MANAGEMENT PLANNING GRID
Mission-critical applications Mission-critical business processing (work space) Business process work-arounds External event
Focus Site or component outage (external) Site outage (external) Application outage (internal) External event forcing change to internal process
Deliverable Disaster recovery plan Business recovery plan Alternate processing plan Business contingency plan
Sample event(s) Fire at the data center; critical server failure Electrical outage in the building Credit-authorization system is down Main supplier can’t ship due to its own problem
Sample solution Recovery site in a different location Recovery site in a different power grid Manual procedure 25% backup of vital products; backup supplier

Source: Gartner Inc., Stamford, Conn.

Copyright © 2005 IDG Communications, Inc.

Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon