The futuristic marriage of machine translation and speech recognition

Speech recognition, executed without mistakes, is the Holy Grail of translation. If we could talk into our smartphones and have vocalized a flawless translation into any language of our choosing, be it Pashtun or Portuguese, the language barrier would no longer exist. But like the Babel Fish in “The Hitchhiker’s Guide to the Galaxy,” perfect translation is still a thing of fiction.

Today, we’re still working on mastering the component parts of speech recognition translation, that is, transcribing something spoken into text (speech-to-text), translating the text from one language to another (text-to-text) and uploading that translation back into spoken form. After years of slow progress, we’re finally making major strides in all three areas, edging ever closer to the mythical Babel Fish.

Speech Recognition Goes Mobile

Speech recognition has roots that go back to the days of Thomas Edison’s phonograph. Machine translation, which began to evolve in the 1950s, is newer. The two technologies working effectively together, however, is a much more recent development.

Until the 1990s, machine translation lived inside of mainframe computers and was limited largely to R&D labs. Personal computers took off with the .com boom in the late 1990s, bringing online machine translators like Yahoo! Babelfish to the desks of individual users.

Such mobile technologies helped the government in its campaigns in the Middle East. Faced with a chronic shortage of qualified translators, agencies reached for newly-mobile translation technologies to fill the gap. In the early 2000s, DARPA deployed the Phraselator in the battlefields of Iraq. Babylon was a handheld, one-way speech recognition device. A soldier would speak a phrase into the machine in English, and it would play back a pre-recorded translation in Arabic. The device, however, only generated accurate translations about 50 percent of the time.

In 2006, IBM took the idea a step further and equipped the U.S. joint forces command with 35 laptops that came equipped with the company’s Multilingual Automatic Speech-to-Speech Translator (Mastor), a two-way program in which a soldier could speak into the laptop in one language (English, for example) and it would output a spoken Iraqi Arabic translation a couple of seconds later.

IBM’s technology was an improvement on the Phraselator, and reflected how automatic speech recognition and machine translation were improving over time. We’re at the point today where both technologies are good enough to gain traction on the consumer side, in the form of Apple’s SIRI and the Google Droid’s Talk to Me app, to name a couple. Microsoft’s Chief Research Officer recently presented in China the latest breakthrough’s in Microsoft’s speech recognition technology, which reduces the error rate by over 30%.

How Machine Learning is Helping Speech Recognition Evolve

On the government side, federal contractor SAIC is making big strides into speech recognition. SAIC works with Arabic content suppliers to automatically translate spoken material into text, essentially a form of closed captioning. Their machine translation technologies are excellent, but where there are gaps, a human specialist will step in to perfect the interpretation. That person will listen to the original audio content and input a better translation into the machine translator, helping its engine, and automatic speech recognition, improve in the process.

That kind of machine learning is what’s really driving the evolution of translation. As human specialists continue to feed high-quality translations into speech recognition technologies and translation memory systems, speech-to-text, text-to-text, and text-to-speech translations will only grow more refined. When a human spots and corrects a mistake in translation, the technology is, in effect, learning from its mistakes. The automatic speech recognizer will have better information to work from when it issues its translations, whether they’re vocalized or put into text. This is especially important for localization efforts — understanding cultural nuances in order to communicate clearly and avoid faux pas that aren’t obvious in print.

How Quality Scores Fit In

SAIC is a good example of some of the exciting things coming down the line with this kind of machine learning as we move into 2013 and beyond. SAIC’s engine automatically recognizes speech, machine translates language and trains its translation memory to generate higher-quality results through a system of automated quality scores and quality metrics.

This kind of scoring is becoming increasingly important in the government’s intelligence efforts, specifically for real-time intelligence, a form of instant information gathering. Machine translation is king when it comes to real-time intelligence, because humans take too long to turn things around. The government uses a number of sensors for intelligence, from drones to Web crawlers that look for suspicious language. If a sensor finds something suspicious in another language, the next step is for machine translation to quickly digest the content and get a gist of what’s being said.

Because there’s so much content, humans receive summaries of large amounts of data at once. At this point, the quality metric jumps in to score data, flagging which content might need humans to jump in for interpretation. The scoring engine could flag keywords, authors and locations of origin, leading to further analysis from human intelligence officials.

A Way to Go Before the Babel Fish

As exciting as developments are, there’s still a lot to be learned by automatic speech recognition and translation. Just take a look at these poorly-translated YouTube Christmas songs to get an idea of how much machines still need to learn.

It is difficult to perfect translation technology because everyone speaks differently, even if they’re speaking the same language. We use different tones, tempos, accents and registers. These things confuse the automatic speech recognition engine, and it takes a lot of quality data to help the engine understand that it’s the same language being spoken.

Moreover, language is evolving over time. I don’t use the same phrases today that I did five years ago. It’s hard for machines to keep up with that evolution, and I don’t think that they ever will. That’s why 10-50 percent of translation will always be left to human experts, while we make the technology as intelligent as possible. In the future, if the Babel Fish does come to fruition, human minds will be inside of it.

Disclaimer: Neither Aaron Davis nor Lingotek has any financial interest in any company or product mentioned in this post.

To express your thoughts on Computerworld content, visit Computerworld's Facebook page, LinkedIn page and Twitter stream.
10 super-user tricks to boost Windows 10 productivity
Shop Tech Products at Amazon