Microsoft takes steps to prevent another WGA meltdown

But it now admits the August incident was an 'outage'

Three months after a major failure of Microsoft's anticounterfeit system fingered legitimate Windows XP and Vista users as pirates, a senior project manager has spelled out the steps his team has taken to prevent a repeat.

Alex Kochis, the senior project manager for Windows Genuine Advantage (WGA), used a company blog to outline new processes that have been put in place, including drills that test the WGA group's response to an outage like the one in late August.

"We've revamped the monitoring that is used to track what's happening within our server infrastructure so that we can identify potential problems faster, ideally before any customer gets impacted," Kochis said. "[And] since August, we have conducted more than a dozen 'fire-drills' designed to improve our ability to respond to issues affecting customers or that could impact the quality of the service."

Those drills, Kochis said, have ranged from pre-announced simulations to surprise alerts that test a specific scenario. "The team is now better prepared overall to take the right action and take it quickly," he promised.

In late August, servers operating the WGA validation system went dark for about 19 hours. Customers who tried to validate their copy of Windows -- a Microsoft requirement for both XP and Vista -- during the blackout were pegged as pirates; Vista owners found parts of the operating system had been disabled, including its Aero graphical interface.

Several days after the weekend meltdown, Microsoft blamed preproduction code for the snafu and said that a rollback to earlier versions of the server software didn't fix the problem immediately, as expected.

Microsoft, however, downplayed the incident, claiming that fewer than 12,000 PCs had been affected. The company's support forums, however, hinted that the problem was much more widespread: one message thread had collected over 450 messages within two days and had been viewed by 45,000 people.

One analyst gave Kochis' status report a mixed grade.

"I was looking for two things from Microsoft, and the first was that they would acknowledge that there was a failure," said Michael Cherry, an analyst at Kirkland, Wash.-based Directions on Microsoft. "If they couldn't do that, it would show a real lack of insight into the severity of the problem. But they called it an 'outage' [here], which I don't think they had actually admitted before."

Cherry was more than on the mark. While Kochis called the incident a "temporary service outage" in his newest post, three months ago, he denied that the word applied. "It's important to clarify that this event was not an outage," he said on Aug. 29, five days after the servers went down.

"Second," said Cherry, "I wondered if Microsoft would acknowledge that failures are going to happen, that something's going to go wrong no matter how many drills they have. And when that happens, what would they do? But I don't see anything like that here."

Kochis said the WGA team has also changed the way it updates the validation service's servers, beefed up free WGA phone support to round-the-clock coverage and improved the speed of delivery of "get-legal" kits to users who discover they're running counterfeit software, but he made no mention of any modifications to the antipiracy program itself, how it's implemented or how users are handled when it determines they're using fake copies of Windows.

"They should make it so that any impact [of an outage] is on Microsoft and not on the customer," Cherry said.

Back in August, Kochis claimed that Microsoft's policy was to do just that -- err on the side of the customer -- but he contended that the outage had been an anomaly. "Our system is designed to default to genuine if the service is disrupted or unavailable," Kochis said then. "If our servers are down, your system will pass validation every time. [But] this event was not the same as an outage, because in this case the trusted source of validations itself responded incorrectly."

That's not good enough, according to Cherry. "If users can't validate, for whatever reason, Microsoft should leave them in their current state, but not invalidate them, or validate them, at least until the next check," he said.

"You have to take the utmost care before you deny something to someone that they have purchased in good faith," he concluded.

Copyright © 2007 IDG Communications, Inc.

How to supercharge Slack with ‘action’ apps
Shop Tech Products at Amazon