Is speech recognition finally good enough?

Better hardware and algorithms nudge the technology closer to its 10-year promise of supplanting keyboards

"For me it's a lifesaver," said Paul Langer, an attorney at the Chicago office of Mayer, Brown, Rowe & Maw. "I never learned to type."

His alternative to the keyboard is speech recognition (SR) software, in this case Dragon NaturallySpeaking (DNS) from Nuance Communications Inc. in Burlington, Mass. Now in Version 9.0, the introduction of DNS a decade ago marked the birth of continuous speech recognition -- previous SR software required the user to pause between words.

Review and video of DNS in action Read about Lamont Wood's personal experience with Dragon NaturallySpeaking and see a video of how it actually works, with on-screen input, mistakes and corrections.

But problems with accuracy, and the need for an hour-long "enrollment" process to train the software to follow the user's voice, meant that typing didn't become obsolete in the intervening decade. However, things have changed.

"I don't know how accurate it is, but if it were not accurate enough, I would go back to typing," said Peter Laipson, a DNS 9.0 user who, unlike Langer, is also a fast typist.

"I use it to do nearly all my grading," continued Laipson, a history teacher from Arlington, Mass., working at a temporary job in San Francisco. "I will dictate comments on relevant parts of an essay and then summary comments at the end. With Dragon I need about 60% as much time to comment on a paper."

He does not claim it's 100% accurate, saying it was not suitable for text with a lot of slang, and recalling a time when it rendered "I really admire your analysis" as "I really admire urinalysis."

"It helps to have a sense of humor, but simple proofreading is enough," Laipson said.

Actually, retaining a sense of humor has been important for multiple reasons in the SR field. In 1993 two executives from Kurzweill Applied Intelligence (which pioneered SR for the medical market) went to prison for faking sales. That firm was sold in 1997 to a Belgium SR firm, Lernout and Hauspie (L&H), which was reporting phenomenal sales growth at the time. Dragon Systems, which originated DNS that year, was reporting only anemic growth, and L&H had no trouble acquiring Dragon Systems in early 2000 in a stock deal. Within a year a series of accounting frauds came to light and L&H collapsed into bankruptcy. Its SR technology was sold in late 2001 to ScanSoft Inc., which kept the DNS line going. (It was then at Version 6.0.) ScanSoft later acquired Nuance and adopted its name.

Starting to deliver

Thereafter, "It was with the launch of Version 8.0 (in November 2004) that the market became reinvigorated and took off," said Chris Strammiello, director of product management at Nuance. "We crossed an invisible line with Version 8.0, where the software actually delivered on its promises and offered real utility for the users. Sales have been growing at a rate of 30% yearly since then, except that we expect it to do better than 30% this year.

"About 60% of the buyers are consumers or what we call proficient professionals," added Strammiello. "The rest are from vertical markets, especially health care and law, whose practitioners are used to dictating and in the past have paid for transcription services. About 10% are people who use speech recognition for accessibility reasons, and that cuts across the other segments," he added.

Version 8.0 reduced the error rate a factor of 30% compared with Version 7.0, while Version 9.0 reduced errors by another 20%, said Strammiello. Overall, about 25% of the accuracy improvements can be credited to today's faster hardware, while the rest stems from improved algorithms, he added. The personal version costs about $200, the professional version costs about $765, and there are also specialized medical and law office versions. Version 9 also includes tools for deploying the software over the network.

Today, "A person can get 95% accuracy right out of the box, and enrollment is optional and only takes five minutes," said Howard Parks, president of Microref Systems Inc., a firm in Highland Park, Ill., that sells SR systems and trains users.

As for input speed, Strammiello said that DNS can keep up with someone talking 160 words per minute, which is about as fast as most people converse. As for typing speed, Rich Stroud, spokesman for the International Association of Administrative Professionals (IAAP), said that ads for clerical job usually ask for at least 40 words per minute.

But despite the obvious speed advantages, there has been no evident rush to adopt the technology. For instance, Stroud noted that only 5% of the IAAP's membership reports using SR software at work. When asked what software they wish the boss would supply, none mentioned SR.

95% isn't good enough

That resistance is at least partly because, in his experience, 95% accuracy is not good enough, indicated Parks. "Most users are not happy until they get to the 98% level," he said. "It's only when you become skillful that you can say that it becomes productive, and it takes five to 20 hours of intensive use to become skillful."

Not being accustomed to dictation is a problem, he noted, but the main pitfalls involve the need for clear, consistent pronunciation, plus a mastery of the correction procedures by which the software learns from its mistakes, steadily increasing its accuracy rate.

Without help on those issues, about three-fourths of the people who attempt to use SR eventually put it aside and go back to keyboarding, Parks said, and even among those with training the rate is about 20%.

Trying another brand of large-vocabulary desktop SR other than DNS is rarely mentioned as an option because there are few alternatives. After the L&H debacle there were basically three entries: DNS, ViaVoice from IBM and software from Philips that was not actively marketed in the U.S., explained Parks. IBM later sold the marketing rights for ViaVoice to Nuance, which uses it as an entry-level product, he noted.

On the other hand, the most widely owned form of SR is probably a version developed by Microsoft, since it is included free in Office XP -- a fact that appears to be unknown to most Office XP users.

"Office XP had it but Microsoft did not promote it -- it was a beta test and they were not comfortable about the quality of the user interface," said Bill Meisel, head of TMA Associates, a speech industry consulting firm in Tarzana, Calif. Unlike DNS, Office XP's SR required that the user rely on the mouse to navigate and make corrections. (Microsoft did not respond to requests for comment.)

Microsoft Vista also has SR built in but uses a voice correction interface similar to the one in DNS, Meisel noted.

Vista's version not up to speed

"It's good, but it's not in the same league as DNS yet," said Parks, who has used both Vista and DNS SR. "But it's a foundation they can improve on over the years. Dragon's research and development is measured in hundreds of person-years, so it will take a few years for Microsoft to catch up."

At Nuance, Strammiello said he saw Vista as more of a promotional vehicle than as a competitor. "It will expose people to what the technology can do, and those who like it will then seek out a premium product," he predicted.

But will that lead to the day when people set aside their keyboards for SR, having discovered that with a few minutes' training they can achieve several times the throughput that they could reach after investing a semester in a typing class?

"If you had asked me 10 years ago (when DNS came out) I would've said that in three or four years the world would be converted," said Parks. "But here we are 10 years later and I still don't know when it will take off. It is expensive, but there is no question that it offers far greater benefits than anything else. My average user creates text 50 to 100% faster than they did before."

"It's misleading to think of SR as a replacement for the keyboard for the average person," cautioned Meisel. "Where the keyboard is really effective is with editing. Getting around is harder with voice -- you can do it but it requires a new learning experience."

Strammiello preferred to talk about the future of the product itself. "We will be broadening the bell curve and getting more and more users to 99% accuracy," he said. "Speaker independence is on the horizon, but how soon that will arrive is unclear. What we can expect in the meantime are more natural commands and a more conversational interface, and more noise robustness so we can speak in a crowded room."

Don't forget to read about Lamont Wood's personal experience with Dragon NaturallySpeaking and see an accompanying video of how it actually works, with on-screen input, mistakes and corrections.

Lamont Wood is a freelance writer in San Antonio.

Related News and Discussion:

Copyright © 2007 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon