Speech recognition: Glamour and deceit?

Why don't machines understand?

Speech is more than words

Robots will only recognize speech when they understand it

Credit: Adobe Stock ps-ixel

“To sell suckers, one uses deceit and offers glamour,” wrote John Pierce of Bell Labs in 1969 as he discussed speech recognition. Seriously, that was his advice against “mad inventors or untrustworthy engineers” that weren't using a scientific approach.

Glamour led to funding, which led to failure, which led to “A.I. winters” in which funding disappeared. Is innovation still being stifled because of those previous failures, or is there a fear of new ideas?

A.I.’s glamour is exploited for marketing purposes today, but progress in A.I. is lacking because we aren't aiming at the right target. As Pierce said about the right approach based on the scientific method, the work should be an experiment, not just an experience.

Many aspects of A.I. are being pursued by big corporate influencers: Nuance, Apple, IBM, Google, Facebook, Microsoft and others that claim glamorous brainlike breakthroughs. As I wrote previously, we won't see brainlike breakthroughs using computer algorithms because it's too hard for programmers.

I’m certain that we will rapidly create speaking machines once we start using science to do it. Science proves its worth daily in our lives: Without it, the struggle is relentless. Just search the Internet for “epicycles” to see how long the wrong model can confound us. (Because the Earth isn’t at the center of the universe, the geocentric model with planets orbiting in circular orbits was complex and futile.)

The problem of speech recognition

I got into trouble this week. I wished my mother-in-law a happy birthday by text message: “Happy birthday Gwendoline!” The response was chilling: “That’s not how you spell my name.”

With some fast typing, "Gwendoleen" got a revised birthday wish, and I got a typical example of why speech recognition fails today. Words are more than sounds.

The problem of speech recognition is more evident when you hear a foreign language. It’s a stream of unintelligible nonsense. You hear the sounds of the language with little idea where words start and end — the word boundary identification problem. Your own language seems different, being just communications with sequences of words.

As the early A.I. pioneers found, it is a challenge to convert speech to text, and in my birthday example, like many language experiences, it took feedback to know a spelling correction was required.


The glamour of speech recognition is that, like a human, a machine can act as a personal dictation service. This should save us time and improve our interactions with machines.

“Speech recognition” is a glamorous term because it suggests that language is being understood. That’s what speech is about — sending ideas by voice. A better term would be “sounds to text,” to describe what today's statistical systems are doing.

How did we get here?

In 1971, the DARPA Speech Understanding Research (SUR) program led to good progress with search technology. Subsequent advances with statistical tools resulted in the adoption of a single approach today — matching sounds to words to phrases using statistics.

A 2008 article in The New Yorker reviewed the "probabilistic approach to speech recognition" more than 35 years later. It quoted Dr. David Nahamoo, the chief technology officer for speech at I.B.M.’s Thomas J. Watson Research Center: “Brute-force computing, based on probability algorithms, won out over the rule-based approach.”

The battle was won, but the war for humanlike accuracy was lost.

The article goes further: "A speech recognizer, by learning the relative frequency with which particular words occur ... could be “trained” to make educated guesses. Such a system wouldn’t be able to understand what words mean..."

Why not glamorous?

Speech recognition doesn't recognize like a human. The statistical technology based on "learning the statistics" has fundamental limitations.

Unlike humans, systems are unreliable: (a) because they are in noisy environments where multiple people speak at once, (b) because higher-level information, like context and meaning, is ignored (c) because other attributes, such as dealing with emphasis and the speed of speech are ignored and (d) because training data doesn't generalize: loading corpora creates statistics only on that data.

Peter Norvig, the director of research at Google, covered the state-of-the-art of statistical learning at a 2011 symposium. In it, American linguist Noam Chomsky, "derided researchers in machine learning who use purely statistical methods ... but who don't try to understand the meaning ..." and then he added that "It interprets success as approximating unanalyzed data." 

Dictation software (speech recognition)

Pierce wrote: “… a general phonetic typewriter is simply impossible unless the typewriter has an intelligence and a knowledge of language comparable to those of a native speaker of English.” In other words: dictation software needs language understanding.

According to Nuance’s website, today’s best-selling dictation software is Dragon NaturallySpeaking. As an occasional user, I have seen its limitations, where subtle and not-so-subtle errors are introduced into the dictated text. 

Its claimed 99% accuracy is poor because it generates sequential text without understanding. And accuracy drops (a) with foreign accents and (b) speech including features of conversation like backtracking. Dragon cannot detect an error in meaning, but humans are different: We immediately ask questions when something makes no sense or we ignore it completely.

Worse, error correction by voice alone is not for the faint-hearted, and the method isn't humanlike. Suffice to say that when you can say, "No, change 'Gwendoline' to end like the word 'keen'," dictation will have come of age.

The problem with machine dictation is that to extract speech, you need to understand what is being said. I can't take dictation for Swedish, for example. Machines for productivity should be intuitive, not force us to learn how to work with them. The idea that technology forces us to learn complex commands seems foolish in 2015, but that’s what is needed with the best-selling software.

No understanding means compromise

If machines understood us, they would be far more useful. Imagine finding a form on a Web page seeking your personal details and saying: "enter my details, please." Today, that isn't an option.

Dragon has the capability to fill out an online form that illustrates the problem. A tutorial shows you how it works. You use commands, not conversation, like this:

“Click textfield, choose 6, San Jose,” for example. And instead of saying “enter my birthday” you select the field and then say “zero six slash zero three slash nineteen sixty-eight” if that’s your birthday. It's not intuitive at all.

Commands aren’t language. They are simply creating a verbal typewriter, and if you can’t remember the commands, you can’t use it. Future machines will use conversation and allow the example: "enter my details," to enter your name, address, birthday and other relevant details.

Getting back to the science and the real target is needed. The compromise to focus on engineering has produced limited results, but we need machine interaction to be much more natural. As Pierce said around 55 years ago, dictation must start with language understanding “comparable to those of a native speaker.”

This article is published as part of the IDG Contributor Network. Want to Join?

The march toward exascale computers
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies