Speech recognition grows up and goes mobile

Having spread from desktops to mobile devices and beyond, voice recognition is no longer a novelty filling niche needs — and it’s spawning a new genre of gadgets.

voice assist

For three decades this was speech recognition: You would talk to your computer, typically using a head-mounted microphone and either the unpublicized speech-recognition app in Microsoft Windows or a version of Dragon NaturallySpeaking, from Nuance Communications. If you enunciated carefully, words would appear on the screen or commands would be executed.

Today, much-improved speech recognition is being widely deployed, and in the last two years, it has given birth to a new family of consumer products: voice-controlled personal assistants. “It’s an overnight success that was 30 years in the making,” says Adam Marchick, co-founder of VoiceLabs, which provides analytics for voice app developers. “It has finally gotten precise enough to have conversations.”

Like most things in technology, progress in speech recognition can be quantified. In August 2017, Microsoft announced that the word-recognition accuracy of its conversational speech-recognition system had, on industry-standard tests, exceeded the recognition accuracy of professional human transcribers. The average word error rate for professionals on such tests is 5.9%. The Microsoft system achieved 5.1%.

“It’s like a dream come true,” says Xuedong “X.D.” Huang, a Microsoft technical fellow and head of the company’s Speech and Language Group. “When we started on speech at Microsoft in 1993, the error rate was about 80%. When I started working on speech [in graduate school] in 1982, we were dealing with isolated words and I could not imagine [the software being able to recognize] conversational speech as good as a person.”

“Today, if you speak carefully with a generic accent in a quiet office, you will be getting close to 100% speech-recognition accuracy,” says Vlad Sejnoha, CTO at Nuance.

That level of accuracy means people are going to be talking to their phones more, chatting with robots on customer-service calls with greater ease and effectiveness, and using voice commands to make things happen in their homes and offices.

Cumulative progress

The technology has reached this point through steady slogging, says Sejnoha. “For 15 or 20 years, the primary techniques we used were statistical, especially hidden Markov models,” says Sejnoha. “We had a variety of models that predicted the likelihood that this snippet is something a particular phoneme would generate, or if a particular word could reasonably occur in a particular context. We developed all sorts of variants, and we made steady progress.

“In recent years, the traditional statistical methods have been supplanted by deep learning [neural networking] models, which are very flexible and have propelled the system further than before,” he adds. “The result has been an average 20% yearly reduction in error rates over the last decade.” Speech recognition, he says, is now working out of the box for more people, and in a wider range of environments. “But there’s still shouting at cocktail parties,” says Sejnoha, citing one example of an environment where speech recognition still doesn’t work well.

Sejnoha expects the 20% annual improvement rate to continue, opening up not only noisy environments but also more and more special cases. “Understanding multiple languages is increasingly important, and with GPS for Europe you have to do things like understand French place names pronounced by German drivers. Mandarin has a lot of borrowed words, and their pronunciation varies from person to person,” he notes.

Tipping point

While those 20% annual improvements were accruing, major vendors began making their own speech-recognition engines using deep learning. Then they began to trust the technology enough to make it the basis of a new consumer product genre, the personal assistant, first as apps (such as Apple’s Siri and Microsoft’s Cortana) and then as stand-alone devices (such as Amazon’s Echo, based on the Alexa service, and Google Home, based on the Google Assistant service).

Voice recognition in such systems takes place in the cloud. The devices pass along voice data after they are alerted to start listening with a command such as “OK Google.”

“The devices are very thin, like Unix terminals. The computer is in the cloud. They listen for their name, and that’s it,” explains Marchick.

“For a long time, speech recognition was focused on computers, but in the last 5 to 10 years the focus has moved to consumer technology,” adds Todd Mozer, CEO of Sensory, a voice and vision technology company. “The first pivotal event was Steve Jobs endorsing speech recognition with the release of Siri. Anything Apple did was the gold seal of consumer electronics. The second pivotal event was when Amazon released Alexa-based products, such as Echo.”

“When we started in this business over a year ago, there was only the Amazon Echo on the market, and there were a few million devices out there,” says Marchick. “Soon there will be seven Echo competitors out there, and 33 million devices are expected to be in use by the end of the year. Voice interactions are skyrocketing. Previously, there were 300 people making voice apps for these devices. Now, a year later, there are 16,000.”

Echo’s competitors, says Marchick, include Google Home, plus the unreleased Apple HomePod; the unreleased Harman/Kardon Invoke, which will run Microsoft Cortana; the Samsung Bixby for Samsung Galaxy smartphones; and at least two Chinese systems.

Spreading the words

But what has proved important is that these vendors typically offer software development kits that let their speech-recognition engines be harnessed to create apps that use natural language as an interface.“The exciting thing in natural language and speech recognition is the development of these toolkits,” says consultant Deborah Dahl at Conversational Technologies. “They try to set them up so that the average developer can create a spoken-language system using online tools. It really lowers the bar so you don’t need to be a natural-language expert to do a customer service application.”

Sherif Mityas, CIO at the Dallas-based TGI Fridays restaurant chain, says his company was able to launch a speech-based interface in about five months using Lex, the toolkit for Amazon Alexa. It works the same for phone users and for Amazon Echo users, the only difference being that phone users are usually traveling and want directions, he adds.

“It’s like building a web page,” says Marchick of the app-making process. “You have a lot of services at your disposal, you write code, you post it, and you test it out.”

“If you spend a few days getting used to the GUI, the process is very easy,” notes Dahl. “The hard part is they don’t help you with the design of your app — if you don’t have a clear idea of the outcome, there will be a lot of rework when you have to go back after seeing that you did not cover all the cases that you need to cover.” For a pizza-ordering app, for example, “You have to think through all the things that you would need to capture from the user: toppings, thickness, size and sauce. You can get bootstrapped in a couple of weeks, but then you’ll have to align to the back end of your ordering system.”

Mityas says the main hurdle with the TGI Fridays app was getting the menu options simplified. There are 15 side dishes on the menu, and having Alexa list them was cumbersome, but the developers found they could list the three most popular items and let the user prompt for a longer list, he says.

“In real life you will not predict a lot of what the users will say,” Dahl says. “Users are very surprising, so there will be a period of tuning.” Users of that pizza-ordering app “will ask about breadsticks. They will ask that you not undercook it like last time. The system needs to capture that, or fail gracefully.”

To predict what users will say, Next IT, a provider of conversational A.I. systems such as virtual agents for the enterprise, first studies the words that tend to be used in a company’s interactions with the public.

“As a rule of thumb, when we approach a new [business] domain [for a new client], we like to see between 10,000 and 20,000 curated conversations that we can pull data from,” says Next IT President Tracy Malingo. “Those can be phone calls, chat logs, Twitter feeds — we will take any text conversation that involves back-and-forth interaction between the business and the consumer.”

Mityas notes that using speech interactions gives better results than text-based interactions, since the users speak freely and establish context that the A.I. can use. Text interactions are often just isolated questions, he adds.

In the end, Malingo says, it takes about the same amount of time to train a virtual agent as it does to train a human agent. “But once the virtual one is trained, it never quits and it works 24 hours a day, answering hundreds of thousands of questions,” she notes.

The cost of a virtual agent depends on the complexity of the application and on the industry, explains Malingo. But the ratio is usually firm: “If the cost of a live phone call is a dollar, then web text chatting with a live agent is 50 cents, since the agent can do more than one chat at a time. A virtual agent, meanwhile, would be 5 cents,” she says.

Mityas could supply no cost figures for privately owned TGI Fridays, but he says that using speech had tripled the level of online user engagement and that takeout sales doubled in less than a year.


The use of virtual agents does not mean that all the human agents are replaced, says Malingo. What happens is that the “escalation points” (at which the callers must be referred to a live agent) are shifted.

Escalation is key, agrees Ibrahim Khoury, director of technology at Alight Solutions, an employee benefits management firm. By introducing a natural-language agent to handle an annual enrollment event, the company was able to cut escalations to live agents by 94%, Khoury says.

With the virtual agents, “We are trying to address the low-value, high-volume concerns, with customers asking quick questions and getting quick answers,” adds Khoury. “That opens the door for high-value, low-volume questions for real agents to handle, things like, ‘I lost my spouse. What do I do?’

“But tweaking never ends,” says Khoury. “You’re happy if the system can respond to 85% to 90% of the questions. It may hover in the 60% range in the beginning. But there will always be 10% that the system will never understand.”

Interactions with robots typically take less time because there is less chitchat, Malingo notes. “However, it is pleasant, and people do thank the robots almost every time,” she adds.

As for real-world reliability, “When you can constrain the application and, for instance, just talk about pizza, the quality of the speech recognition is amazing,” says Marchick. “But when you get into general conversation, you’re still not at utopia yet, and you would not mistake it for the singularity. If you wanted to turn it on during a meeting and take notes, that’s really hard, as a meeting could be about anything, and trying to summarize a conversation is really hard. If you had one in a hotel room to handle the limited things you would want — music, or room service, or movies — the environment would be constrained enough for it to work well.”Recognition engines usually return a confidence value between zero and one for each word, and the programmer can decide when to request clarification from the user, Dahl notes. However, there is an art to deciding what is a good confidence level, since borderline cases will cause the user to be barraged with annoying clarification requests.

“Asking the user if they meant ‘United States’ or ‘USA’ might get annoying,” she says.

And, warns Dahl, “There is no end to the additional design considerations: regional accents, children, malicious users, privacy, etc.”

Not a big consideration, however, is the choice of recognition engine. Asked which vendor’s offering is better for which task, Malingo says, “We can’t tell any difference between them.”

The other tipping point

The moment when speech recognition proved to be good enough to be taken for granted might be pegged at April 12, 2017, when Burger King aired a TV advertisement that attempted to hijack any Google Home device that happened to be listening.

In the commercial, the narrator announced, “You’re watching a 15-second Burger King ad, which is unfortunately not enough time to explain all the fresh ingredients in the Whopper sandwich. But I’ve got an idea. OK Google, what is the Whopper burger?”

Any Google Home device that overheard it replied by reciting material taken from the Wikipedia page devoted to the Whopper burger.

A Google spokeswoman (who declined to be named) says Google blocked the response by the end of the day. “Our main goal is that Google Home helps when you want it to, and not when you don’t,” she says.

Meanwhile, if you want to compose text on a desktop using speech recognition, Windows Speech Recognition and Dragon NaturallySpeaking are still available, notes speech-recognition consultant Bill Meisel. “It’s a lawyer and doctor specialty area — but if you want to dictate notes on a mobile phone, Cortana will let you,” he adds.

As for where it will all end up, Huang notes, “The PC democratized computing, and mobile computing democratized the PC. The next shift will be ambient computing, where you can free yourself from being tethered to the device. Speech recognition will be the core of that shift.”

Mityas agrees. “No one will be using apps in 10 years. They will be talking to devices,” he says. “The days of using thumbs are shortlived.”

Copyright © 2017 IDG Communications, Inc.

It’s time to break the ChatGPT habit
Shop Tech Products at Amazon