Sidebar: Future Talk

Research in speech technology started at AT&T Bell Laboratories in 1936 and is still going strong. Today, however, other companies have taken the lead. IBM's Superhuman Speech Recognition Project aims to develop a recognition system that meets or exceeds human performance in real-world conditions, which can be affected by background noise or a person's particular speech characteristics. The system, expected to be completed by 2010, will incorporate visual information, like lip-reading does, as well as more conventional speech-recognition techniques.

Meanwhile, BBN Technologies in Cambridge, Mass., is halfway through a five-year project for the Defense Department. Effective, Affordable, Reusable Speech-to-Text, or EARS, aims to turn radio and TV broadcast speech or telephone speech in multiple languages into text with 90% to 95% accuracy.

The capability to turn audio signals from broadcasts, meetings and the like into indexed, searchable text is in a nascent stage, but it will improve, predicts John Makhoul, chief scientist for speech and language processing at BBN Technologies. "If the goals of the EARS project are achieved, it will make [audio] indexing not something that's nice to have, but a necessity."

As the technology evolves, analysts and researchers say the way the technology is used is likely to change. The following are among their predictions:

  • IVR systems will always take the call first. In most call centers today, some calls are handled by the interactive voice response system and some by human agents. If you're in the IVR and you hit zero to get an operator, you start all over again. In the future, those modes will be better integrated. "Every call should go to a recognizer first. Then the recognizer takes the call to the human agent with a recommended solution," says Mark Plakias, an analyst at San Francisco-based research firm Zelos Group Inc.
  • VoIP will facilitate multimodal applications.
  • Voice over IP will allow voice packets and graphical user interface information to travel together over a network. That will enable "simultaneous, multimodal" speech applications in which the user could, for example, choose to do his I/O on a mobile device using a combination of text and speech, according to David Nahamoo, group manager in human language technology at IBM Research.

  • Voice dictation will take off in Asia. Speech recognition by dictation products still can't match the speed and accuracy of a good human typist. Where the technology has a big advantage over typing is in Chinese, where every character requires several keystrokes. "Most of the companies still in the business look to China as the right market," says Alexander Rudnicky, a senior systems scientist at Carnegie Mellon University in Pittsburgh.

Simply identifying spoken words isn't enough, says Rudnicky. Research in speech technology must address multiple fronts, including recognition, understanding, dialogue management and language generation/speech synthesis. And while recognition accuracy is good it still needs work, "you can never have too few errors," he says.

Copyright © 2004 IDG Communications, Inc.

Shop Tech Products at Amazon