No-Fault Windows

Continuous computing systems offer an alternative to clustering for Windows-based servers that require high availability. By Robert L. Mitchell

For many enterprise computing systems and applications, downtime isn't an option. For emergency-response organizations, lost time can mean lost lives. In the business world, lost opportunities to take action can mean lost revenue.

And the IT people who must keep computers working in environments like those need special hardware and software, especially if they're running Windows-based applications.

Most of the transaction processing at retail payment processor Lynk Systems Inc. moves through high-end, proprietary fault-tolerant systems from Stratus Technologies Inc. The applications, written in Cobol, work well with Stratus' proprietary Virtual Operating System (VOS). Downtime—even a few minutes—isn't acceptable. "If [the systems] go down, our merchants go down," says Carl Cliche, vice president of support systems at Atlanta-based Lynk. But when the need for a new transaction processing application requiring a Web interface resulted in a design using SQL Server 7 on Windows NT, availability became a major issue.

Despite all the advertising hype about "five nines," or 99.999% uptime on Windows servers, a new server cluster fell short of expectations. "We've had big problems with clustering technology," Cliche says, citing fail-over problems and the need for special cluster-aware applications and scripting. "It was all terribly complicated."

So he moved the application to Maynard, Mass.-based Stratus' ftServer 5200, a fault-tolerant system that brings continuous computing technology to Windows 2000 Advanced Server as an alternative to clustering. The ftServer supports up to four hot-pluggable Intel Corp. Pentium III or Xeon processors and splits I/O and compute processing functions into separate, redundant modules that run in lock step to provide uninterrupted computing in the event of a hardware failure. The ftServer sheds Stratus' proprietary VOS—and the typical six-figure price of the high-end systems.

"It looks like a normal NT server. You can use the standard software, the applications don't have any special requirements and it's much easier to manage [than a cluster]," Cliche says.

Continuous computing systems for Windows could be the beginning of a new trend. The market for high-availability systems is growing fast. Sales are expected to jump from $52.5 billion this year to $84.2 billion in 2005—an 18.4% compound annual growth rate, according to market research firm Harvard Research Group Inc. (HRG) in Cambridge, Mass.

In addition, the need for Web-enabled interfaces to critical back-end systems and the increasing importance of Windows-based applications, such as databases and e-mail used for transactions, may drive demand for this new class of Windows-based fault-tolerant computing. "Today, where people exchange purchase orders and documents using e-mail, it's understandable that you would have a requirement for high levels of availability," says Bob Besautelf, an analyst at HRG.

These continuous computing systems were previously available only on high-end systems like Compaq Computer Corp.'s Himalaya and Stratus' Continuum, which use proprietary hardware and operating systems that can cost from about $70,000 to millions. By contrast, Windows-based systems from Stratus and Boxborough, Mass.-based Marathon Technologies Corp. start at about $20,000, making them an attractive alternative to clustering.

The applications that are most in need of this type of hardware are those that can't tolerate the short downtime involved in a cluster fail-over or where organizations lack on-site expertise to manage a server cluster. For example, the San Diego Fire Department's 911 software tracks the locations of emergency vehicles and automatically dispatches the nearest vehicle to a call. Here, the two minutes required for system fail-over is unacceptable. "Two minutes in the life of someone having a heart attack could be their life span," says Doug Bolton, an information systems analyst at the department. The department installed a Stratus system which cost about $100,000. Installation and setup required about 20 hours—substantially less time than it took to set up the previous clustered system, Bolton says.

For InSight Telecommunications Corp. in Boston, which provides satellite and fiber-optic capacity to broadcasters, the issue is one of maintaining business relationships. InSight relies on its software for resource scheduling and management. Broadcast networks need immediate answers and won't wait while a server reboots, says CEO Keith Buckley. "If NBC News calls and we can't provide the service, they're not going to call again," he says.

InSight uses Dell Computer Corp. PowerEdge servers with Marathon's Endurance fault-tolerant Peripheral Component Interconnect (PCI) card and software. While Stratus sells a complete, integrated system, Marathon's technology requires four off-the-shelf servers—two functioning as compute processors and two as I/O processors—operating in lock step over a dedicated high-speed connection. Marathon relies on a network of resellers to integrate the systems and supports servers from IBM, Hewlett-Packard Co., Compaq and Dell. "The fact that Marathon gave us the opportunity to spec which hardware we wanted was important to us," Buckley says. The tab: just less than $75,000, including $45,800 for the Dell hardware.

Hidden Costs

Marathon and Stratus claim that the total cost of ownership of these systems is less than that of clustered systems, but analysts say IT managers should do their own math. Both vendors cite the expense of cluster-aware software, scripting, fail-over testing, staff training and maintenance costs as reasons to purchase their systems rather than a cluster. Both also have arrangements with Microsoft requiring only one Windows or Exchange Server license per system. For other software, IT managers may have to negotiate terms.

Costs for a fully implemented system can go well beyond the $20,000 to $30,000 starting prices. And Stratus' systems typically include a monitoring service contract that adds up to about 20% of the initial system cost annually. "We're used to that with our mainframe Stratus," says Cliche. In the commodity Windows server market, however, others may opt out, analysts say.

Donna Scott, an analyst at Stamford, Conn.-based Gartner Inc., says that on systems such as Exchange Server 2000 that support active-active clustering, the backup system can be performing other tasks until a failure occurs. That's something you can't do on a continuous computing architecture. "Let's face it, nobody likes idle resources," she says.

While the technology isn't new, its implementation on Windows is still evolving. The San Diego Fire Department has had two failures since installing Stratus' system in June, when the product first shipped. One involved a programmable read-only memory flash update; the other required a firmware replacement. Each time, however, the system continued without interruption. Marathon's product has been available on Windows NT Server since 1997, but with some 1,700 system sales, it's still a very small niche in the high-availability server market.

Although these systems run Windows, the hardware architecture is still proprietary. Stratus implements its own Windows hardware abstraction layer in its ftServers and supports only adapters for which Stratus-hardened device drivers are available. "You can't just pop an extra PCI card in here and there," says Cliche. But Bolton says the trade-off is worthwhile. "By insisting on hardening drivers, I think they're overcoming a lot of the issues that my peers are having with their implementations of Windows 2000," he says. Marathon relies on third-party integrators to choose system components.

While the Stratus and Marathon systems offer their own management software, neither has a Simple Network Management Protocol Management Information Base for interoperation with enterprise network management tools; both firms say they're working on one.

Scalability is another limitation. Marathon's technology doesn't support symmetrical multiprocessing (SMP). Stratus' ftServer series offers a limited number of configurations with support for up to a four-processor SMP system. Still, "there's absolutely a place for fault-tolerant Windows systems, because they provide a special functionality that you can't get from a cluster today," says Scott.

"If you look at where these things are installed, it's in those sites where they don't have the infrastructure in place to take advantage of load-balancing and clustering across a front-tier or midtier architecture," says Tom Manter, an analyst at Aberdeen Group Inc. in Boston.

The best applications may be found where the costs of downtime are high, Scott says. "Enterprises always buy what's good enough," she says. And for many applications, a clustered system with a fail-over time measured in minutes may be all that's needed. But at least one user takes a different view. "If this thing works out, I'll look to put [ftServers] in for other [Windows] applications that don't currently run on fault-tolerant servers," says Cliche.

1pixclear.gif

High-Availability Resource Links

  • Computerworld's QuickStudy on fault-tolerant computing.
  • This links to a series of technical articles on Windows 2000 clustering technology.
  • Stratus offers several white papers on its ftServer technology and on high-availability issues.
  • This Web site includes a white paper comparing clustering with Marathon Technologies' continuous computing architecture.
1by1.gif

Three Fault-Tolerant Architectures

In a WINDOWS SERVER CLUSTER, the primary system handles all processing, while the fail-over system remains on standby (in an active-active cluster, the standby server may be used to perform other work until needed). The cluster software uses a dedicated "heartbeat" connection between the machines to monitor operations and initiate a fail-over. Devices may share a common SCSI bus for access to redundant storage arrays. When a failure occurs, processing stops while the fail-over system boots the applications and comes online. The typical fail-over time may be two minutes or longer.

No-Fault Windows

MARATHON'S ASSURED AVAILABILITY ARCHITECTURE uses a special PCI card and software that interconnects four servers—two functioning as compute processors and two as I/O processors—over a high-speed bus. Special software and connections allow the systems to run processes in parallel. During a failure, the backup server takes over without an interruption.

No-Fault Windows

STRATUS' FTSERVER DUAL-MODE SYSTEM ARCHITECTURE includes two compute processor modules and two I/O modules in a single server enclosure. A special embedded application-specific integrated circuit connects the modules via a proprietary high-speed backplane. All processes run in parallel for uninterrupted operation during a failure. An optional third processor module maintains fault tolerance in the event of a single processor module failure.

No-Fault Windows
Computerworld's IT Salary Survey 2017 results
Shop Tech Products at Amazon