Watson -- and friends

Review: 6 machine learning clouds

Amazon, Microsoft, Databricks, Google, HPE, and IBM machine learning toolkits run the gamut in breadth, depth, and ease.

Watson -- and friends

Show More

What we call machine learning can take many forms. The purest form offers the analyst a set of data exploration tools, a choice of ML models, robust solution algorithms, and a way to use the solutions for predictions. The Amazon, Microsoft, Databricks, Google, and IBM clouds all offer prediction APIs that give the analyst various amounts of control. HPE Haven OnDemand offers a limited prediction API for binary classification problems.

Not every machine learning problem has to be solved from scratch, however. Some problems can be trained on a sufficiently large sample to be more widely applicable. For example, speech-to-text, text-to-speech, text analytics, and face recognition are problems for which "canned" solutions often work. Not surprising, a number of machine learning cloud providers offer these capabilities through an API, allowing developers to incorporate them in their applications.

These services will recognize spoken American English (and some other languages) and transcribe it. But how well a given service will work for a given speaker will depend on the dialect and accent of the speaker and the extent to which the solution was trained on similar dialects and accents. Microsoft Azure, IBM, Google, and Haven OnDemand all have working speech-to-text services.

There are many kinds of machine learning problems. For example, regression problems try to predict a continuous variable (such as sales) from other observations, and classification problems attempt to predict the class into which a given set of observations will fall (say, email spam). Amazon, Microsoft, Databricks, Google, HPE, and IBM provide tools for solving a range of machine learning problems, though some toolkits are much more complete than others.

In this article, I'll briefly discuss these six commercial machine learning solutions, along with links to the five full hands-on reviews that I've already published. Google's announcement of cloud-based machine learning tools and applications in March was, unfortunately, well ahead of the public availability of Google Cloud Machine Learning.

Amazon Machine Learning

Amazon has tried to put machine learning in easy reach of mere mortals. It is intended to work for analysts who understand the business problem being solved, whether or not they understand data science and machine learning algorithms.

In general, you approach Amazon Machine Learning by first cleaning and uploading your data in CSV format in S3; then creating, training, and evaluating an ML model; and finally by creating batch or real-time predictions. Each step is iterative, as is the whole process. Machine learning is not a simple, static magic bullet, even with the algorithm selection left to Amazon.

Amazon Machine Learning supports three kinds of models -- binary classification, multiclass classification, and regression -- and one algorithm for each type. For optimization, Amazon Machine Learning uses Stochastic Gradient Descent (SGD), which makes multiple sequential passes over the training data and updates feature weights for each sample mini-batch to try to minimize the loss function. Loss functions reflect the difference between the actual value and the predicted value. Gradient descent optimization works well for continuous, differentiable loss functions only, such as the logistic and squared loss functions.

For binary classification, Amazon Machine Learning uses logistic regression (logistic loss function plus SGD).

For multiclass classification, Amazon Machine Learning uses multinomial logistic regression (multinomial logistic loss plus SGD).

For regression, Amazon Machine Learning uses linear regression (squared loss function plus SGD).

amazon ml model report

After training and evaluating a binary classification model in Amazon Machine Learning, you can choose your own score threshold to achieve your desired error rates. Here we have increased the threshold value from the default of 0.5 so that we can generate a stronger set of leads for marketing and sales purposes.

Amazon Machine Learning determines the type of machine learning task solved from the type of the target data. For example, prediction problems with numerical target variables imply regression; prediction problems with non-numeric target variables are binary classification if there are only two target states, and multiclass classification if there are more than two.

Choices of features in Amazon Machine Learning are held in recipes. Once the descriptive statistics have been calculated for a data source, Amazon will create a default recipe, which you can either use or override in your machine learning models on that data.

Once you have a model that meets your evaluation requirements, you can use it to set up a real-time Web service or to generate a batch of predictions. Bear in mind, however, that unlike physical constants, people’s behavior varies over time. You’ll need to check the prediction accuracy metrics coming out of your models periodically and retrain them as needed.

Azure Machine Learning

In contrast to Amazon, Microsoft tries to provide a full assortment of algorithms and tools for experienced data scientists. Thus, Azure Machine Learning is part of the larger Microsoft Cortana Analytics Suite offering. Azure Machine Learning also features a drag-and-drop interface for constructing model training and evaluation data flows from modules.

The Azure Machine Learning Studio contains facilities for importing data sets, training and publishing experimental models, processing data in Jupyter Notebooks, and saving trained models. Machine Learning Studio contains dozens of sample data sets, five data-format conversions, several ways to read and write data, dozens of data transformations, and three options to select features. In Azure Machine Learning proper, you’ll find multiple models for anomaly detection, classification, clustering, and regression; four methods to score models; three strategies to evaluate models; and six processes to train models. You can also use a couple of OpenCV (Open Source Computer Vision) modules, statistical functions, and text analytics.

That’s a lot of stuff, theoretically enough to process any kind of data in any kind of model, as long as you understand the business, the data, and the models. When the canned Azure Machine Learning Studio modules don’t do what you want, you can develop Python or R modules.

You can develop and test Python 2 and Python 3 language modules using Jupyter Notebooks, extended with the Azure Machine Learning Python client library (to work with your data stored in Azure), scikit-learn, matplotlib, and NumPy. Azure Jupyter Notebooks will eventually support R as well. For now, you can use RStudio locally and change the input and output for Azure later if needed, or install RStudio in a Microsoft Data Science VM.

When you create a new experiment in Azure Machine Learning Studio, you can start from scratch or choose from about 70 Microsoft samples, which cover most of the common models. There is additional community content in the Cortana Gallery.

azure ml studio

The Azure Machine Learning Studio makes quick work of generating a Web service for publishing a trained model. This simple model comes from a five-step interactive introduction to Azure Machine Learning.

The Cortana Analytics Process (CAP) starts with some planning and setup steps, which are critical unless you are a trained data scientist who's already familiar with the business problem, the data, and Azure Machine Learning, and who has already created the necessary CAP environments for the project. Possible CAP environments include an Azure storage account, a Microsoft Data Science VM, an HDInsight (Hadoop) cluster, and a machine learning workspace with Azure Machine Learning Studio. If the choices confuse you, Microsoft documents why you’d pick each. CAP continues with five processing steps: ingestion, exploratory data analysis and pre-processing, feature creation, model creation, and model deployment and consumption.

Microsoft recently released a set of cognitive services that have "graduated" from Project Oxford to an Azure preview. These are pretrained for speech, text analytics, face recognition, emotion recognition, and similar capabilities, and they complement what you can do by training your own models.

Editor's Choice

Databricks

Databricks is a commercial cloud service based on Apache Spark, an open source cluster computing framework that includes a machine learning library, a cluster manager, Jupyter-like interactive notebooks, dashboards, and scheduled jobs. Databricks (the company) was founded by the people who created Spark, and with Databricks (the service), it's almost effortless to spin up and scale out Spark clusters.

The library, MLlib, includes a wide range of machine learning and statistical algorithms, all tailored for the distributed memory-based Spark architecture. MLlib implements, among others, summary statistics, correlations, sampling, hypothesis testing, classification and regression, collaborative filtering, cluster analysis, dimensionality reduction, feature extraction and transformation functions, and optimization algorithms. In other words, it’s a fairly complete package for experienced data scientists.

databricks model

This live Databricks notebook, with code in Python, demonstrates one way to analyze a well-known public bike rental data set. In this section of the notebook, we are training the pipeline, using a cross validator to run many Gradient-Boosted Tree regressions.

Databricks is designed to be a scalable, relatively easy-to-use data science platform for people who already know statistics and can do at least a little programming. To use it effectively, you should know some SQL and either Scala, R, or Python. It's even better if you're fluent in your chosen programming language, so you can concentrate on learning Spark when you get your feet wet using a sample Databricks notebook running on a free Databricks Community Edition cluster.

InfoWorld Scorecard
Variety of models (25%)
Ease of development (25%)
Integrations (15%)
Performance (15%)
Additional services (10%)
Value (10%)
Overall Score (100%)
Amazon Machine Learning 8 9 9 9 8 9 8.7
Azure Machine Learning 9 8 9 9 8 9 8.7
Databricks with Spark 1.6 10 9 9 9 8 9 9.2
HPE Haven OnDemand 7 8 8 8 7 8 7.5
IBM Watson and Predictive Analytics 10 9 9 9 9 8 9.2
1 2 Page 1
Page 1 of 2
7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon