Practitioners in information technology face pressures on many fronts. In addition to the demands to become more efficient, IT must now address challenges to maintain a secure state and comply with regulatory requirements. For example, the Sarbanes-Oxley Act of 2002 is forcing publicly held U.S. corporations to attest to the fact that internal controls are both in place and effective. IT operational best practices, such as the Information Technology Infrastructure Library (ITIL), provide a framework to start defining repeatable and verifiable IT processes. However, as organizations attempt to use ITIL to begin their journey toward process improvement, they face a very difficult question: How and where do you start?
We have developed a methodology known as "Visible Ops." Since 2000, we have met with hundreds of IT organizations and identified eight high-performing IT groups with the highest service levels, best security and best efficiencies. What was most amazing about them was that they shared the following attributes: a culture of change management, a culture of causality and a culture that fundamentally valued effective and auditable controls, promoting fact-based management. Visible Ops reflects the lessons learned about how these organizations work and describes a control-based entry point into the world of ITIL that others can leverage to springboard their own process improvement efforts.
In the IT industry, Stephen Elliot, an analyst at IDC, showed that on average, 80% of IT system outages are caused by operator and application errors. This motivated our need to dig into causal factors of infrastructure downtime, which continually revealed shortfalls in change management practices. Often, many organizations would have well-documented change management practices, but in reality, no one ever followed them. In many of these cases, the goals and motivations for having change management were not clear to management or to the practitioners themselves. Another key finding was that having a documented change management process was necessary, but far from sufficient, to achieve high-performing characteristics. In the high-performing organizations we studied, change management was embedded in the culture and had a very different meaning than in typical organizations. The Visible Ops Handbook is dedicated to describing those practices that set the high performers apart.
Something Must Need Improvement – Otherwise, Why Read This?
"The most likely way the world will be destroyed, most experts agree, is by accident. That's where we come in; we're computer professionals. We cause accidents." – Nathaniel Borenstein
The motivation for ITIL, change management and overall process improvement is well known. The trade press is full of stories about cost-cutting measures, outsourcing and regulatory requirements from Sarbanes-Oxley, HIPAA (the Health Insurance Portability and Accountability Act of 1996), Basel II, FISMA and so forth. The list of people talking about the problems is already large enough, so we promise to keep the discussion of the problem domain to a minimum. The issues and challenges that need to be addressed include:
- Organizations that have change management processes, but view these processes as overly bureaucratic and diminishing of productivity. There must be more to change management than bureaucracy, good intentions and scarcely attended meetings.
- Organizations where, deep down, everyone knows that people circumvent proper processes because crippling outages, finger-pointing and phantom changes run rampant.
- A "cowboy culture," where seemingly "nimble" behavior has promoted destructive side effects. The sense of agility is all too often a delusion.
- A "pager culture," where IT operations believe that true control simply is not possible and that they are doomed to an endless cycle of break/fix triggered by a pager message at late hours of the night.
- An environment where IT operations and security are constantly in a reactive mode, with little ability to figure out how to free themselves from firefighting long enough to invest in any proactive work.
- Organizations where both internal and external auditors are on a crusade to find out whether proper controls exist and to push madly for implementing new ones where they are not in place.
- Organizations where IT understands the need for controls but does not know which controls are needed first.
"It is not enough to show that a situation is bad; it is also necessary to be reasonably certain that the problem has been properly described, fairly certain that the proposed remedy will improve it, and virtually certain that it will not make it worse." – Robert Conquest
We developed the Visible Ops methodology because everyone seemed to be asking the same urgent question: "I believe in the need for IT process improvement, but where do I start?" There were no satisfactory answers. Although ITIL provides a wealth of best practices, it lacks prescriptive guidance: In what order and how should the practices be implemented? Moreover, the ITIL books remain relatively expensive to distribute widely. The third-party information that is publicly available on ITIL still tends to be too general and vague to effectively aid organizations. Visible Ops uses ITIL terminology and is intended to be an "on-ramp" to the rest of the ITIL body of knowledge.
History of Visible Ops
Since early 2000, Gene Kim, CTO of Tripwire Inc., and Kevin Behr, CTO of IP Services, have studied what contributes to the success of high-performing IT organizations. IP Services is a business process outsourcing company that provides business support services, including revenue-generating e-business infrastructure and IT assets.
At IP Services, the IT operations group reports to Kevin, and for years he tried to understand how to best increase service levels and decrease cost to maximize value. Tripwire is a software vendor for a product that detects change – it was originally written by Gene in 1992 as an intrusion-detection technology to help systems administrators recover from the 1988 Morris Internet worm. Gene has spent years trying to understand why Tripwire's largest customers kept insisting that the software was not a security technology, but instead a technology to enforce their change management processes.
Kevin and Gene began working together when they discovered they had a common passion to really understand what differentiated high-performing IT organizations from their more typical counterparts. Visible Ops began to take shape when they started studying a list of organizations that Gene had been keeping for years, which he called "Gene's list of people with amazing kung fu."
After years of research and investigation, Kevin and Gene now refer to this list more formally as "the high-performing IT operations and security organizations with the highest service levels, as measured by mean time to repair (MTTR), mean time between failures (MTBF), and availability; the early integration of security requirements into the operations life cycle; the lowest amount of unplanned work; and the highest server-to-system-administrator ratios." What makes the organizations on this list especially astonishing is that they also have more efficient cost structures than lower-performing organizations.
To coordinate and expand their efforts, their works were donated to the Information Technology Process Institute (ITPI). The ITPI is a not-for-profit organization engaged in three principle areas of activity: research, benchmarking and the development of prescriptive guidance for practitioners and business executives. The ITPI has collaboration agreements in place with research organizations such as the University of Oregon's decision sciences program and the Software Engineering Institute at Carnegie Mellon University. The ITPI also attracts many other contributors through the ITPI Community of Practice List (ICOPL). Currently, there are hundreds of top practitioners from IT operations, security, audit, management and governance on the ICOPL, representing thousands of years of IT experience.
Through research, development and benchmarking, the ITPI creates powerful measurement tools, prescriptive adoption methods (such as Visible Ops) and control metrics to facilitate management by fact. The end result of these efforts is to assist organizations with their IT process improvement efforts.
Common Characteristics of High-Performing IT Organizations
What makes high-performing organizations so different from average organizations, both qualitatively and quantitatively? We observe that high-performing IT organizations share the following characteristics:
- Server to system administrator ratios greater than 100 to 1: This means that each system administrator controls more than 100 servers. In contrast, organizations not using effective processes see ratios around 15 to 1.
- Low ratio of unplanned to planned work: Only 5% of operational expense goes toward unplanned work. From our ongoing benchmarking, we find that average organizations spend 25% to 45% of their total operational expenses on unplanned, unscheduled work.
- Higher staffing early in the IT life cycle: Continual deployment of resources and staff in the preproduction build phase, where the cost of defect repair is least expensive.
- Collaborative working relationships between functions: IT operations and security work together to solve common objectives, with IT operations performing most of the work and security acting as coach and consultant.
- Posture of compliance: Trusted working relationship between IT operations and auditors, because controls are visible, verifiable and regularly reported on.
- Culture of change management: Ubiquitous understanding throughout the organization that changes must be managed in order to achieve business objectives.
- Culture of causality: Through the use of controls and metrics, these groups identify and solve problems through logical use of cause and effect, instead of a culture of "let's see if this works."
- Management by fact: These organizations value controls and metrics, not only to aid effective problem-solving, but to aid fact-driven decision-making, as opposed to "management by belief" or "management by the honor system."
|Figure 1: Server to System Administrator Ratio|
Why Did We Use ITIL?
To understand what the best-in-class organizations were doing, Gene and Kevin wanted to determine the union and intersection of their IT processes. In other words, what are the common practices of all the high-performing IT operations organizations studied, and which ones are necessary to achieve the high-performing characteristics? Even this line of questioning was a challenge, because each organization had independently developed its own processes, and each had Darwinistically evolved to learn from past mistakes to prevent certain IT disasters from ever happening again. Because they were building their own playbook, as opposed to using an external standard, each organization called similar processes by different names. For example, one organization's "change management" process was another's "work authorization request system" or "change control" process. As a result, Kevin and Gene first needed some way to normalize terminology in order to determine what processes these organizations had in common.
To resolve this terminology problem, they did a Google search on "release management and change management," which brought them to ITIL. ITIL is a compilation of IT best practices, provided without prioritization or any prescriptive structure. ITIL provides a framework and catalog of IT operational processes, distilled from thousands of man-years of experience. Initially created in the late 1980s, the ITIL body of knowledge continues to be enhanced and better organized, most significantly (in our opinion??) in the form of the BS 15000, which divides all the ITIL disciplines into five key areas: release processes, control processes, resolution processes, relationship processes and service delivery processes.
|Figure 2: BS 15000 view of ITIL process areas|
The BS 15000 categorizes the ITIL capabilities into five areas. Each is briefly described below:
Release processes: This process area answers the question, "Where does infrastructure come from before it is deployed?" This includes activities such as the planning, designing, building and configuring of hardware and software. Unfortunately, release processes are traditionally the last process area that organizations invest in. Yet this is the process area that delivers the highest return on investment, because it encompasses the entire preproduction infrastructure, where the cost of defect repair is lowest.
Control processes: This process area covers maintaining production infrastructure, not only to prevent service interruptions, but also to efficiently deliver IT service. This is done through change management, as well as asset and configuration management. BS 15000 defines change management as well as asset and configuration management as primary controls. As Stephen Katz, former CISO of Citibank, once said, "Controls don't slow the business down; like brakes on a car, controls allow you to go faster."