Maximizing server uptime: Best practices

Keeping servers up and running requires a mix of careful planning, preventive maintenance and common sense.

In an IT world full of elusive goals, there's probably no target as slippery and generally elusive as server uptime.

Keeping servers alive and awake, or at least ready to instantly spring into action whenever needed, is an ambition close to the heart of virtually all data center leaders.

Yet few managers can honestly say that they are doing absolutely everything to squeeze the most uptime out of their systems. Indeed, many managers needlessly lavish time and funds on technologies and practices that have little or no positive impact on uptime, experts say.

Achieving server uptime excellence is both a science and a management art, says Walter Beddoe, vice president of IT and logistics at Six Telekurs USA, a financial data provider in Stamford, Conn. "It's a combination of many different things, including having a competent staff, using fault-tolerant hardware, adopting dynamic security practices, and embracing good maintenance and change management practices," he says. "Most of all, you must have a commitment to doing your very best."

Alan Howard, IT director at Princeton Radiology, a diagnostic medical imaging firm in Princeton, N.J., urges managers not to waste time and resources on activities and tools that don't directly contribute to uptime enhancement. The effort put into clustering, for example, can be "pretty wasteful," he says, noting that redundancy is better achieved with a tool that provides full automation.

Clustering that is not automated -- where the synchronization is done manually -- can cause more problems than it's worth, Howard says. "A failure of a primary node can cause havoc; we'd have been better off simply recovering from the primary-node failure than failing to the standby node," he says.

For instance, his shop had a Windows Server cluster that, upon failover, would cause the application to crash because a change to an application configuration file had not been applied to the stand-by server. "The effort to fix the cause of the application crash tended to be much more than the effort to fix the cause of the cluster-node failure," Howard says.

His shop no longer provisions clustered servers in the traditional sense. Instead, he has a "cluster" of stand-alone servers -- all mapped to a dual-controller Compellent Storage Center SAN -- "among which we can migrate virtual machines on demand quite seamlessly."

Getting organized

Most managers agree that carefully planning all server-related work, from acquisition to management to replacement, is a key step in guaranteeing system reliability.

Raoul Gabiam, IT operations and engineering manager at George Washington University, says life-cycle management is an integral part of server uptime planning in his shop. "Knowing when and how to replace hardware and upgrade software is important, as it affects performance, sustainability and overall uptime," he says.

For example, if you have to perform a software upgrade, understanding the hardware requirements and the state of your current hardware is critical. You may want to buy the hardware as part of the software upgrade to ensure that requirements are met and to avoid further outages, or perform one before the other to minimize the number of changes, Gabiam explains.

Gabiam is also a strong believer in standardization and coordination as a way of ensuring reliable server operation. "Before anybody installs anything or makes a change, there has to be a change management process," he says.

1 2 3 4 Page 1
Page 1 of 4
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon