The risks of a big man-made IT disaster are on the rise

IT services are but one human error away from a spectacular failure, and there's very little evidence to suggest that we've found a way to stop people from making mistakes.

There's scant evidence that process improvements, security training or technology advances are reducing human errors in IT operations. If anything, the risk of technology disasters is growing, despite the industry's best efforts.

Security breaches and IT outages are getting bigger and they're getting worse: The number of people at risk of being affected by each new incident is on the rise because of our growing interconnectedness.

The Root of the Problem

The common point of failure in just about every incident? Human error. People are responsible in some way for most IT disasters. That has led to increased interest in artificial intelligence (A.I.) tools, among other technologies, in hopes of bolstering security and reliability. But new technologies and methodologies bring new risks. As physicist and cosmologist Stephen Hawking recently noted: "The development of full artificial intelligence could spell the end of the human race."

An A.I.-orchestrated destruction of the human race would indeed be the biggest IT failure ever. But given the ongoing and seemingly unstoppable streams of information security failures, that might be a gamble worth taking.

The evidence is stark: In just the past several months, 800,000 records from the U.S. Postal Service were compromised by intruders, a breach of Home Depot's systems put 56 million payment cards at risk, and 76 million names and addresses were stolen from JPMorgan Chase. Oh, and in August, security services provider Hold Security estimated that a Russian criminal gang, the CyberVors, had stolen more than 1.2 billion unique sets of emails and passwords from 420,000 Web and FTP sites.

And once again, the strongest IT safeguards often don't do any good preventing a data breach if a person makes a mistake: In its 2014 Cyber Security Intelligence Index, IBM found "human error" to be a contributing factor in 95% of all incidents investigated.

Uptime at Risk

IT outages don't spark as much outcry as data breaches, but they can still be damaging. Data centers may be able to claim that they offer 99.999% uptime (with downtime per year limited to 5 minutes, 26 seconds), and major providers of cloud-based services tout at least 99.99% availability (meaning downtime won't exceed 52 minutes, 56 seconds a year), but outages still occur.

And the aggregate risks from those outages are growing because so many critical IT services are now concentrated among a handful of cloud providers. Small human errors can easily trigger big problems affecting more people.

Last April, for instance, Amazon blamed an outage on a configuration change that had been "executed incorrectly." More recently, Microsoft said a problem with its Azure platform was caused by a system update. And in 2013, there was a Google Gmail outage and a Yahoo Mail outage -- the latter prompting an apology from CEO Marissa Mayer.

The Uptime Institute reports that analysis of 20 years of abnormal incident data from its members shows that human error is to blame for more than 70% of all data center outages. Moreover, those failures are more costly now than they were in the past.

When Kroll Ontrack, a provider of data recovery services, surveyed its customers about data losses, 66% of the respondents cited desktop and server crashes as the top reason for losses, while only 14% said the losses could be attributed to human error. But that latter figure isn't as small as it seems.

Jeff Pederson, manager of data recovery operations at Kroll, noted that 25% to 30% of the revenue his firm earns comes from restoring data lost because of human error.

An Ounce of Prevention

The standard industry response when something goes wrong is to remind users that disaster recovery is a shared responsibility. But there are concrete steps that IT users, vendors and service providers can take to prevent downtime and breaches.

One step is to make sure you follow best practices.

CenturyLink, a global data center provider, recently earned the Uptime Institute's Management and Operations Stamp of Approval for its 57 data centers. The certification program recognizes facilities with rigorous operations management processes.

Drew Leonard, CenturyLink's vice president of colocation product management, said that striving to keep things running well is essential, because an outage can damage a data center's reputation for years.

Vendors are also turning to new security tools that rely on predictive analytics and machine learning to enable users to "try to take action before any demonstrable harm" occurs, said John McClurg, chief security officer at Dell.

The idea is to move to machine analysis of incidents and leave the interpretation to humans, said Kevin Conklin, vice president of marketing and strategy at Prelert, a machine learning firm.

Said Conklin: "Humans are highly unpredictable."

Copyright © 2015 IDG Communications, Inc.

It’s time to break the ChatGPT habit
Shop Tech Products at Amazon