Unplanned Work Is Silently Killing IT Departments

You can't see it. You can't smell it. But it's deadly, and it may be in your IT organization's basement, silently killing your company. It's called unplanned work, and CIOs and chief information security officers are losing their jobs because of it. This silent killer is so hard to recognize that many IT professionals don't even realize it exists.

Some will challenge, "If unplanned work is so deadly, where are all of the dead bodies?" The answer is that they're everywhere. Once you see them, you'll see that they are one step removed from the root of virtually all IT problems.

It's difficult to overestimate the effect of unplanned work on an IT organization. Here's a back-of-napkin calculation. According to a Forrester Research estimate from 2002, 10% of the U.S. gross domestic product is spent on IT, comprising 50% of corporate capital expenditures. But IT projects are like a free puppy -- the capital cost of the puppy is dominated by the "operate/maintain" costs, not the initial acquisition costs. The U.S. GDP in 2004 was approximately $10 trillion; if 10% of that is spent on IT, and if we estimate that 50% of that IT spending is on "operate/maintain" activities, and if at least 35% of that work is unplanned, that's $350 billion. That's a lot of dead bodies. For many companies, the IT controls work for Sarbanes-Oxley Act Section 404, which AMR Research estimates will exceed $6 billion in 2006, is an unplanned activity. More dead bodies.

Unplanned work is any activity in the IT organization that can't be mapped to an authorized project, procedure or change request. Any service interruption, failed change, emergency change, or patch or security incident creates unplanned work.

The amount of unplanned work in your IT organization is a remarkably accurate indicator and predictor of IT effectiveness. In 2002, early in my firm's research on high-performing IT organizations, we developed a 75-question assessment to determine whether or not an organization is high-performing. We look back at this assessment now with some embarrassment, because now we believe we can make conclusions about an organization's maturity and needed prescriptive steps by asking just one question: What percentage of your IT organization's work is unplanned? Those organizations that spend less than 10% of their time on urgent and unplanned work also usually have extremely high levels of operational excellence, compliance and security and have good working relationships with auditors.

While CIOs aspire to focus on strategic issues, they must first master the tactical, because unplanned work comes at the expense of strategic planned work.

A common view is that IT has two business functions:

  1. Build and complete new projects for the business:The ideal IT organization is completing projects on time, with reliable quality, and is delivering needed capabilities to the business. These IT projects are planned work, and anything that detracts from completing them is unplanned work. If developers spend 30% of their time on emergency break/fix issues escalated from IT operations, project commitments suffer, often resulting in late projects.
  2. Operate/maintain existing IT services and assets effectively, efficiently and securely: In the ideal, IT services are performing as advertised and promised, with a reliable level of quality, and customers are satisfied. Controls exist so that IT management detects variance early and can repair it in a planned and orderly manner, and controls exist to foster a culture of compliance, helping IT management achieve business goals and satisfy auditors.

For the purposes of this discussion, let's suppose that there are two extremes: high- and low-performing IT organizations. In my research with the IT Process Institute, we have found that high performers have the lowest amount of unplanned work (less than 5%). Low performers typically have poor service quality, with constant service outages, break/fix work and fire fighting. They also have unhappy customers who seem to see every mistake, auditors constantly bombarding them with more documentation requests, tests and archaeology projects -- and of course, high amounts of unplanned work (often exceeding 50%).

The sources of unplanned work are very different for high and low performers. I asked several experts for the top contributors to unplanned work in lower-performing IT organizations. Their answers were virtually identical:

  • Failed changes: The production environment is used as a test environment, and the customer is the quality assurance team.
  • Unauthorized changes: Engineers do not follow change management process, making mistakes harder to track and fix.
  • No preventive work, making repeated failures inevitable: Mean time to repair may be improving, but without root-cause analysis, the organization is doomed to fix the same problems over and over.
  • Configuration inconsistency: Inconsistencies in user applications, platforms and configurations make appropriate training and configuration mastery difficult.
  • Security-related patching and updating: Inadequate understanding and consistency of configurations makes applying security patches extremely dangerous.
  • Too much access: Too many people have too much access to too many IT assets, causing too many preventable issues and incidents.

On the other hand, for high performers, this group of consultants cites "product and environment failures" as the top cause for unplanned work. "Release failures" and "people mistakes/user errors" fall second and third on their lists, because these causes for unplanned work are much rarer in mature IT organizations. According to one of the experts, "Mature organizations have proper checks and balances to keep these things from happening and catch them when they do. This is the linkage that tests the integration between process, people and technology."

Quantifying the Cost

Now that we recognize how unplanned work detracts from IT operational goals by pulling IT professionals away from the activities that achieve them, can we quantify the costs of unplanned work to justify the return on investment of setting controls to reduce them?

We sometimes hear the question, "How can you justify the cost of implementing IT controls? Show me a business case for us to buy testing servers and the tools to enforce our change management process." It's a fair question, and one that can be addressed with a simple example.

Suppose someone changes an IT asset, but the change fails catastrophically due to lack of preproduction testing and change management authorization. The failed change results in an "all hands on deck" situation for the IT operational staff; IT drops planned work to remedy the results of the changes. The service disruption causes an incident that takes four hours to repair and involves 25 IT staffers from all functional roles: application developers, QA workers, database administrators, network and systems administrators, and security. Lost IT staff productivity is the first cost of this episode of unplanned work.

Unplanned work also comes at the cost of planned project work. In this case, the application developers and QA staffers are taken from the critical path of an important sales support project, and the project ship date slips one week. In addition, to address this project delay, IT has to employ a team of contractors longer.

The costs continue to mount. While the IT staff works to restore service, external customers call the service desk to find out why they can't access their billing information. Because of the large customer base, thousands of customers call the service center. The excess calls require the service center to activate the overflow call center, which costs tens of thousands of dollars. Revenue is also disrupted because the service center staff can't take orders while processing the customer incidents.

Downtime and IT project-resource costs run in the thousands of dollars; service center costs, lost revenue and the delayed IT project costs are in the tens of thousands. Let's take it one step further. Maybe customers become so unhappy that 2% of them leave. The business now has to spend hundreds or thousands of dollars to recapture each of those customers.

Now that your single rogue change affects customers, costs increase almost exponentially. With unhappy customers, you now have marketing and public relations problems. Your marketing department has to both gain new customers and win customers back -- a feat more difficult and more expensive than gaining brand-new customers.

(And, there is one more extremely high cost of unplanned work. Each one of those late projects, which are getting even further delayed, had some ROI that the business attached to it. So, every moment of unplanned work delaying that project has a quantifiable opportunity cost. IT is suddenly the obstacle that is preventing the completion of the project to help the field sales force increase sales 15%. More dead bodies.)

With any business process that is close to the customer and the business, unplanned work can quickly and easily rack up huge costs. After looking at our scenario, how can you justify not implementing change controls and testing?

Try the following exercise: Look at your top 10 unplanned outages in the past quarter or year and determine which ones were caused by failed changes. Of the failed changes, which ones were untested or unauthorized? Calculate the cost of unplanned work for each of those episodes. If any of those failed changes resulted in disruption similar to our scenario, you have created a business case for IT controls.

It's easy to see how one failed change can quickly add up to hundreds of thousands of dollars -- and how implementing IT change control processes can easily pay off tenfold.

Gene Kim is chief technology officer at Tripwire Inc. Contact him at genek@tripwire.com.

Related:
8 highly useful Slack bots for teams
  
Shop Tech Products at Amazon