Yesterday was a tough day for corporate IT administrators tied to McAfee. In some cases, they faced a full-blown meltdown of their organization's PCs, as hundreds, in some cases thousands, of Windows XP computers went down after receiving a faulty antivirus update from the security firm.
The whole story's not clear at this point, but there are some things that we know -- and a lot that we don't -- about the latest debacle from a vendor that is supposed to protect, not prang, PCs.
This is our first take on what happened, who was hit and why. If you have time to read this, we're assuming you're not one of those scrambling to bring crippled machines back to life.
What happened? Short answer: McAfee screwed up.
The long answer is more complicated. Wednesday's update -- McAfee pushes daily updates to its corporate customers -- was meant to detect and destroy a relatively minor threat, the "W32/wecorl.a" virus. Instead, it went rogue, wrongly fingered the critical "svchost.exe" file in Windows XP Service Pack 3 (SP3) as malware, and then quarantined it by removing it from its normal location. In some cases, the update actually deleted the file.
Think of the snafu as if the police pinned a crime on a suspect based on flawed DNA testing, only to find out they'd got the wrong guy.
Why did the PCs crash and burn after getting the bad update? Without svchost.exe -- a generic host process for services that run from other Windows DLLs (dynamic link libraries) -- a Windows PC won't boot properly.
When users applied the update, then rebooted, they were toast: The machines crashed and rebooted repeatedly. Most also lost all network capability, and some were unable to "see" USB drives, a major problem since recovery may require the reinstallation of svchost.exe, something that could be done more easily by walking a flash drive from one crippled computer to the next.
What machines were affected? Only PCs running Windows XP Service Pack 3 (SP3), says McAfee.
Other version of Windows XP, including SP1 and SP2, were not nailed by the update, nor were systems running Windows 2000, Vista, Windows 7, Windows Server 2003 and Windows Server 2008. McAfee also said even older editions -- such as Windows 98 -- were unaffected.
There are, however, scattered reports on the McAfee support forum of Vista machines also going down.
My company runs Windows XP SP3. Why were only some crippled? Good question.
Only machines running VirusScan 8.7 were affected, users reported and McAfee confirmed. If you're running an older version, including the earlier Enterprise 8.5, you were in the clear.
A McAfee manager shed some additional light on why some Windows XP SP3 systems were clobbered, while others kept on running. "I've not seen any reports from customers who had left this setting disabled," said Samantha Price, a manager of McAfee's global threat response team, in a message on the firm's support forum for VirusScan Enterprise.
The setting Price referred to, "Scan Processes on enable," is off by default in most installations of VirusScan 8.7.
But not all. One user told Robert McMillan of the IDG News Service that his installation was on by default. And a McAfee support document urges users to set the feature to off after updating VirusScan 8.7 to Patch 1. There's also note in the VirusScan 8.7 Patch 3 update's Readme file that says the same thing.
To make matters even more confusing, Mike Davis, the managing director of Centrality, a U.K.-based network design and support firm, said that fresh installs of McAfee's Enterprise Policy Orchestrator (EPO) have the setting off by default, but upgrades do not. "Investigations today [have] shown that if you upgrade to the latest EPO, ...in the majority of scenarios the setting at question is enabled by default," said Davis. "If it's a clean install of EPO to the latest version, then we believe the option to be off by default."
EPO is McAfee's corporate security management platform, and is used to push out signature updates enterprise-wide.
Today, McAfee said it was planning to publish an FAQ of its own that would spell out in more detail which customers where affected, and which were not. A company spokesman didn't set a timetable for the FAQ's appearance, saying only that it would be "soon."
Seems clear what happened. But how did this slip through? McAfee hasn't explained everything, but it did acknowledge that "mistakes happen," in the words of its executive vice president of support, Barry McPherson.
Other than that, the only comment has been from a company spokesman, who said yesterday, "We are investigating how the incorrect detection made it into our DAT files and will take measures to prevent this from reoccurring."
John Pescatore, an analyst with Gartner, said that there were actually a pair of failure points at McAfee. "First, a warning should have appeared that the svchost.exe file was about to be quarantined or deleted, rather than just stick it in quarantine," said Pescatore, who noted that he had no inside information on the foul-up. "And second, their automated QA [quality assurance testing] of signatures failed."
We may get the inside dope at some point: McAfee has promised to make public the results of its investigation. "McAfee is focused on a comprehensive root cause analysis of the issue. We will make this information available publicly as quickly as possible," the company said in a page posted to its site.
Some users can't wait. "When this is all over, I expect McAfee to have some straight answers for me," said CrazyFingers, the user who started the longest-running support thread. "I expect them to explain clearly how a mistake of this magnitude managed to happen in the first place. Then I want to know how it slipped past QA."
How did the problem spread so fast? Pushy updates, says Pescatore.
"Everyone has been on a push for speed," said the analyst, referring to antivirus vendors who have amped up the delivery of signatures in an attempt to match the pace of new malware development by hackers. "They've turned the knob toward faster updates."
Centrality echoed Pescatore. "Typically, customers who use McAfee's Enterprise Policy Orchestrator (EPO) have aggressive update deployment set-ups to ensure the exposure time to true virus threats is minimized," the company said in a after-action report it drafted on the McAfee disaster (download PDF). "It is because of this standard, aggressive, deployment process that the update was able to get to a large number of machines so quickly."
How do I clean up after McAfee's mess? Manually.
Because virtually all the affected PCs were unable to connect to a network, corporate support personnel must touch each individual machine.
Yesterday, McAfee spelled out the steps to take here (consumers should instead look here). Today, it made available a semi-automated tool, dubbed "SuperDAT Remediation Tool," that may need to be run after entering Windows' Safe Mode. Download the tool to a system that can connect to the network or Internet, then copy it to a flash drive. Walk that drive from one downed PC to the next, running it at each.
How can I fend off any future fiascos? Pescatore had a couple of recommendations.
"Seven, eight years ago, companies routinely tested antivirus updates before pushing them into the organization," he said. "But in the last five years or so, that's been dropped."
But because anti-malware updates come so frequently from vendors -- the need-for-speed syndrome again -- it's unlikely enterprises can adequately test before deploying. Instead, he said it's smart to let others be the guinea pigs. "You don't want to be the first to apply any update, whether a security patch or these," he said.
Gregg Keizer covers Microsoft, security issues, Apple, Web browsers and general technology breaking news for Computerworld. Follow Gregg on Twitter at @gkeizer or subscribe to Gregg's RSS feed . His e-mail address is email@example.com.