March 17, 2004
(Computerworld)
Security and IT operations often act as if they are at war with each other, with completely opposing goals. You've probably seen it. For instance, security works hard to create a policy to ensure that the organization remains in an acceptable defensive posture, only to have it completely ignored by IT operations. So when the organization gets hit with something like the MS Blast worm, many critical servers are affected. As a remedy, security creates a list of urgent patches to be applied. However, due to the wide variety of server configurations, the patch doesn't consistently succeed. As a result, IT operations is left with a server, or hundreds of servers, that no longer even boot!
In scenarios like this, the patching cure prescribed by security is worse than the disease. A political blame game can follow, creating an adverse relationship between security and IT operations. More energy is put into unproductive activities, and business goals are compromised, including the delivery of a stable, available and secure computing infrastructure that fulfills business requirements.
Observed practices for success
I recently participated in two workshops where more than 70 practitioners from high-performing IT organizations shared their experiences on how they achieve and sustain their security and operational objectives. The first workshop, "Auditable Security Controls That Work," I co-chaired with the SANS Institute; the other workshop, "Best in Class Security and Operations Roundtable," I co-chaired with the Software Engineering Institute at Carnegie Mellon University.
This two-part article describes my observations and key findings from these workshops. This first article describes the challenges and solutions common to this group. The second article will explore a working definition of what it means to be a high-performing IT organization and will describe the resulting works in progress.
In the two workshops, three key management practices emerged as common to high-performing security and IT operations organizations: They rigorously enforce the change management processes, they foster a "culture of causality," and they ensure that security adheres to and helps enforce the effective management of change. Each of these practices is described below:
Rigorously enforce the change management processes
Common to all the high-performing organizations we studied is a culture of change management. Why so much rigor around how changes are made? They recognize that change represents the significant majority of risk to IT availability and security. Market research company IDC confirms their intuition, showing that 78% of all downtime can be attributed to changes made internally, by someone with access and authority.
A shining example of a culture of change management is Securities Industry Automation Corp. (SIAC), which runs the infrastructure for the New York Stock Exchange and the American Stock Exchange. To proactively manage operational risk, changes are rigidly made and tested in the hours after the market closes to ensure that it opens on time and runs at optimal performance. "We hold our data center managers accountable that all changes are authorized," says Mike Prospect, vice president of secure financial transaction infrastructure at SIAC. "They must sign off that all changes can be mapped to an authorized work order. If the change isn't authorized, the change is rolled back."
In these high-performing organizations, change management isn't viewed as a rubber-stamping bureaucracy that needlessly consumes valuable time, resources and energy. Instead, change management is a critical component of how work must be done to achieve business goals.
Build a "culture of causality" in problem management processes
The damage caused by uncontrolled change runs deeper than merely causing outages; it also prolongs downtime. Typically, more than 80% of time spent repairing and restoring service after an outage is spent trying to determine exactly what changed. Valuable time is often spent in phone calls to systems administrators asking if they changed anything and randomly making other changes or rebooting servers to see if such moves fix the problem. All these efforts consume time and energy and sometimes makes the problem worse!
However, high-performing organizations proactively create a culture of causality that highlights how the majority of problems are self-inflicted. This culture is integrated into the way problems are managed and solved. The Microsoft Operations Framework study validates the difference in behavior with a stunning statistic: It found that customers with the highest service levels rebooted their servers 20 times less frequently than the average customers and also had five times fewer "blue screens of death." I would wager that they also spent five times less effort doing security-related tasks. Clearly, this is desirable. It begs the question, "How does one promote this type of operational behavior?"
High-performing organizations ensure that problem managers have all causal factors at hand when they begin problem diagnosis. For example, when a problem manager starts working on a SQL Server outage, listed in the service ticket are all the authorized change orders for that specific server for the past 30, 60 and 90 days, as well as all the actual changes made on that server. By referencing the service ticket, the problem manager sees any potential causes that might have contributed to the outage. Most of the time, they will be able to diagnose the problem and derive a fix without even logging into the server!
Have security adhere to and help enforce change management processes
Often, the most difficult part of managing security is ensuring that security is integrated into all IT operational processes. Clearly, security can't do all the work, especially when it's outnumbered by IT operations by 100 times or more. By integrating security teams into change management processes, not only can they become aware of changes, but they can also veto risky changes and insert necessary work into the IT operational work queue.
Conclusion
The practices listed above show how security works best when it's embedded inside of IT operations, helping ensure consistency of practice and continual reduction of operational variance. These practices help security become integral with the work performed by IT operations. Furthermore, security shares responsibility and accountability that don't undermine the organization's defensive posture. This, in turn, alleviates the problems of resource mismatch and process integration that arise in other organizations.
In the second part of this article, I will describe qualitative and quantitative characteristics of high-performing IT operations and security organizations as they differ from average organizations and describe the follow-on activities of the two workshops.
Do you believe your IT organization is best in class? Are you interested in joining the community of practice? If so, and you're interested in sharing practices and metrics, please e-mail me at genek@tripwire.com or visit the information-sharing Web page at http://www.itpi.org/home/icopl.php