It's spring -- time we turn the clocks ahead. For us, it's also a reminder that it's time for companies to plan for business continuity.
Why? One client whose IT environment comprised a mix of operating systems had not established a single time source. Systems ran just fine. But when a failure took place, the effects -- though apparent throughout the infrastructure -- couldn't be correlated, because the system clocks weren't synchronized.
As IT staffs continue to patch and modify heterogeneous operating systems, the variety and complexity of error conditions multiplies for business continuity. Disaster recovery -- restoring data and returning to a functional state -- is well understood. However, we've found that many companies may not completely understand the complexity of business continuity.
Before the alarm bell rings
The good news is that business continuity and disaster recovery share the same infrastructure. But many companies don't plan or execute effectively because they don't follow operational best practices. The IT Infrastructure Library offers a definitive approach to disaster recovery and business continuity. We frequently bring the following elements to our clients' attention:
1. Establish a service-level agreement.
All disaster recovery and continuity work begins with agreement on what matters most to the business. For example, if access to a trading-floor application is lost for 15 minutes, the financial effect can be tremendous. This agreement forms the basis for service-level agreements (SLA) about IT performance.
SLAs should be more than availability definitions. The familiar target measured in number of "9s" of availability is often chosen without thought for much more than uptime. In addition to performance and service outages, SLAs must include application updates, release-schedule guarantees, even patch management activity -- all of which factor into systems' continuous operation.
Don't oversimplify this aspect of planning. Consider that the components of the application infrastructure fit into the availability scheme: An application instance is supported by the presentation-layer, application-layer and data-layer servers.
In all practicality, an SLA of 100% -- though improbable -- won't mean that every component in the environment requires 100% uptime. Instead, a service-level objective should be defined for each component, relative to other components, so that overall environmental performance delivers the agreed-on service level. Start with the element most crucial to the SLA -- the database, for example -- and factor in other components' performance.
2. Identify potential problems with achieving the SLA.
Develop scenarios that outline exactly what could go wrong and what it would take to mitigate it. Then rank these scenarios for probability and cost. Next, prioritize them for executive sign-off. Agreement on projected losses gives a realistic idea of the resources required for continuity.
For organizations just beginning an implementation, the definition of failure scenarios is a chance to set options for creating an application environment to eliminate specific vulnerabilities. Buy-in from executive leadership will lead to a road map for deployment.
3. Perform data classification.
Many clients haven't evaluated what data an application requires and the sensitivity of that data. Data classification reflects data availability requirements and in turn determines storage infrastructure for business continuity. Skipping this detailed but crucial step makes it hard to define costs, easy to overengineer or overbuild application infrastructure -- and easy to overspend.
4. Understand the risk thresholds for different areas of the business.
This insight enables the services desk to make intelligent decisions when, for instance, a server has failed. If the recovery time objective is 30 minutes and it will take 15 minutes to identify a problem, it's important to know when your "go/no go" decision must be made.
5. Develop detailed procedures for each scenario approved.
The failure scenarios selected are the basis for disaster recovery and business continuity planning and need to be adequately communicated to all architects and developers to ensure consistency in approaches to application development and infrastructures. Failure scenarios shine a light on the risks so that all are engaged in mitigation.
6. Test, test, test.
While a "minor" change to the IT environment might not effect recovery, it could have ramifications for successful fail-over. Untested, minor changes have an unknown effect on site fail-overs.
A good time to test is with a new release of an application, especially one with business continuity requirements. Testing may reveal new options or the elimination of certain failure scenarios that should be factored into the final release.
The clock is ticking
In essence, disaster recovery and business continuity come down to planning and preparation. With proper planning, business continuity allows people to make smart decisions in compressed time frames, with little information. With adequate preparation, any event can be swiftly dealt with using tried and tested methods. Without either, your company stands to suffer losses that will extend far beyond the measured cost.
Christopher Burry is a technology infrastructure practice director and fellow at Avanade Inc., a Seattle-based integrator for Microsoft Corp. technology that's a joint venture between Accenture Ltd. and Microsoft. David Mancusi is technology infrastructure practice director of Avanade's Eastern Region. Comments or questions can be sent to Christopher.Burry@avanade.com.
Preparing For The Worst
Stories in this report:
- Editor's Note: Preparing For The Worst
- Disaster Homework
- Rising From Disaster
- Classic Mistakes
- Get In Sync With Suppliers
- A Dose of Reality
- The Almanac: Disaster Recovery
- Opinion: Hold Outsourcers To High Standards
- Shark Tank: Disaster Recovery
- Data Recovery Planning: The First Step
- A Business Continuity Checklist
- One IT Manager's Story of Recovery After a Disaster
- Survivor's Guide: 10 Disaster Recovery Tips
- Calculating the Cost of Downtime