In the 2016 Google Founder’s Letter, CEO Sundar Pichai cited Google’s long-term investment in machine learning and AI. “It’s what allows you to use your voice to search for information,” he explained, “to translate the web from one language to another, to filter the spam from your inbox, to search for ‘hugs’ in your photos and actually pull up pictures of people hugging ... to solve many of the problems we encounter in daily life. It’s what has allowed us to build products that get better over time, making them increasingly useful and helpful.”
In addition to using machine learning for its own products, Google has released several applied machine learning services -- for vision, speech, natural language and translation -- and has open-sourced its TensorFlow scalable machine learning package. An additional service based on TensorFlow, the Cloud Machine Learning Platform, is still in a closed alpha test phase. I hope to review the Cloud Machine Learning Platform and TensorFlow later this year.
In this preview, I’ll take a close look at the Google Cloud Vision, Cloud Speech, Cloud Natural Language, and Cloud Translate APIs, and compare them to competitive pretrained services from HPE, IBM, and Microsoft. While Amazon and Databricks also compete in cloud machine learning prediction, they don’t offer pretrained APIs.
All four Google machine learning APIs are managed by the Google Cloud Platform Console, and all have RESTful interfaces; some also have RPC interfaces. There are three authentication options; which one to use depends on the API and the use case.
Google Cloud Natural Language API
Natural language processing is a big part of the “secret sauce” that makes Google Search popular. Ask “where should I visit in China,” and Google Search will parse enough of your intent to show you articles about popular travel destinations in mainland China at the top of your results list.
It will also show you related queries, such as “visit china visa” and “is it safe to visit china.” Note that the natural-language processing has extracted the verb “visit” and the object “China,” and distinguished the country China from the Republic of China (Taiwan) and from bone china crockery. It has used syntax parsing and entity identification to find popular “nearby” queries in its historical database.
Ratings are an important aspect of the Google Play store. Millions of ratings on a scale of one to five are accompanied by reviews -- for example, “Awesome app!” with a five-star rating, or “Very buggy and hard to tell who’s in picture” with a two-star rating of the very same app. Think about these reviews as a great data set to use to train a natural-language-processing neural network for sentiment analysis.
The Cloud Natural Language API, currently in open beta, gives you access to Google’s entity recognition, sentiment analysis, and text annotation (syntax analysis) engines for text. Entity recognition and text annotations are supported in English, Spanish, and Japanese; sentiment analysis is supported only in English. You can embed text in your API call or read a text file from a Google Cloud Storage bucket.
The entity recognition service identifies persons, organizations, locations, and other items mentioned in text. The sentiment analysis service looks at a block of text and decides to what extent it is positive or negative, and it estimates an intensity or magnitude of sentiment. Text annotations analyze parts of speech and provide dependency parse trees for the relationships between words.
As you can see in the figure above, the Entity Recognition service tends to find entities that have Wikipedia articles and returns the URI of the articles. The basic Python code used is shown in the figure below.
Haven OnDemand includes Graph Analysis services (also in preview) trained against English Wikipedia; these are similar to Google entity recognition. IBM Watson Concept Expansion and Concept Insights are similar to Google text annotations and entity recognition. The Microsoft Cortana Entity Linking API is similar to Google Entity Recognition, its Linguistic Analysis API is similar to Google text annotations, and its Text Analytics API includes sentiment analysis, key phrase extraction, and topic detection for English text.
Google Cloud Speech API
“Google, how old is the Brooklyn Bridge?”
Most Android smartphone users and people who do Google searches by voice over Chrome are familiar with that pattern. The Google Cloud Speech API, currently in open beta, exposes the engine behind the voice transcription used in Google Now, Google voice search, and Google Translate to companies that want to voice-enable their own sites and apps.
The Google Cloud Speech API provides speech-to-text conversion; it doesn’t do text-to-speech. It handles some 80 languages and variants, and that selection is heavy on variants, including nine localizations of English from Australia to the United States, 18 localizations of Spanish, and 15 localizations of Arabic.
There is no automatic language detection in the API; you need to set the language code accurately for the speaker (rather than the location) to get good recognition. For example, a South African or Zimbabwean with a strong accent, living in the United States and speaking English, is more likely to get good recognition using the
en-ZA language code than the
That’s consistent with the experience people have not only with Google voice search, but also with Apple Siri, Microsoft Cortana, and apps using third-party recognition engines such as Nuance NDEV. If you’re writing an app that uses the Cloud Speech API, you’ll probably want to default to the system language code but offer an interface for changing the language code for the app.
The Google Cloud Speech API has both synchronous and asynchronous batch APIs for transcribing stored audio and complete utterances, and a streaming API to recognize speech live. It handles long-form audio in batch, along with short utterances, and offers both REST (nonstreaming only) and RPC APIs.
You can embed your audio in your service call or point to a GCP bucket that contains an audio file. In addition to a language code, the recognition configuration that accompanies the audio specifies the audio encoding, the sample rate, the maximum number of alternatives to return, whether a profanity filter should be used, and a speech context.
Cloud Speech takes word hints that expand its already large vocabulary and increase the likelihood of correct recognition of expected words. It also does command recognition. The optional speech context contains a list of up to 50 phrases with as many as 100 characters each. You can use this for voice-controlled games, and you can combine it with the Cloud Natural Language API.
Supported audio encodings include FLAC (recommended), LINEAR16, MULAW, AMR, and AMR_WB. Note that lossy music formats such as AAC and MP3 are not supported because the recognition accuracy suffers from the compression. Only mono audio is supported.
As you can imagine, Cloud Speech builds on very large training sets gleaned from its use by Google to service voice search. That implies it has learned a large range of regional variations. For example, the spoken U.S. English language includes many diverse dialects, from a Georgia drawl to New England dropped R's (“Pahk the cah”), to distinctive Lawn Guyland (“Eyoo gawt it”) and Philadelphia (“D’youse want wudder?”) accents. Cloud Speech has also learned to handle noise from, for example, passing cars, and in fact Google recommends that you not try to filter the audio for noise prior to sending it for speech recognition.
HPE Haven OnDemand can recognize 21 languages and variants, including both broadband and telephony quality data sets for the most common languages -- for example, Telephony Latin American Spanish. Haven OnDemand can extract audio from video as well as audio files, but does not support synchronous or live recognition.
IBM Watson can recognize eight languages and variants in broadband quality, as well as six in telephony quality. On the standard plan, using telephony models is twice as expensive as using broadband models. The transcription of incoming audio is continuously sent back to the client with minimal delay, and it is corrected as more speech is heard.
Microsoft Bing Speech Recognition supports 28 languages and variants. Real-time streaming is supported on Android, iOS, and Windows when you use the appropriate client library. If you train a Language Understanding Intelligent Service (LUIS) model, you can also receive structured information about the recognized speech to parse the intent of the speaker and drive further actions by the app.
Google Cloud Translate API
The Google Translate website and app, along with the Google Website Translator gadget, have been popular for years. In the early days, bilingual human translators would often roar with laughter at Google’s attempts at machine translation. Over the years, however, human translators have had the opportunity to correct the mistakes made by the machine translator, and many of the corrections have been incorporated into the translation corpus. As a result, Google’s machine translations have improved considerably, although the quality still varies from one language pair to another.
Google Translate API is a paid enterprise service for translating large amounts of text. It supports 90 languages, making thousands of language pairs, though not every language pair is supported. You can, however, query the API for a list of supported iso639-1 language codes in JSON format and a list of supported targets for any given source language.
If you don’t know the identity source language, you can leave out the source language code and the API will try to recognize it. Language detection costs the same $20 per million characters as language translation.