Remember that quote I started part two with? About data scientists wanting better tools for wrangling so they could work on the "sexy stuff"? Well, after covering how data is stored, how it's cleaned and how it's combined from disparate databases, we're finally there. Data applications are where the "sexy stuff" like predictive analysis, data mining and machine learning happen. This is the part where we take all this data and do something really amazing with it.
Broadly, I've broken this column of our ecosystem into two main branches: insights and models. Insights let you learn something from your data, while models let you build something with your data. They're the tools that data scientists use to explain the past and to predict the future.
We'll start with insights.
I've segregated these tools into four major categories, namely statistical tools, business intelligence, data mining and data collaboration. Those first two are large, mature segments with tools that have been around, in some cases, for decades. Data mining and collaboration aren't quite brand new, but they are less mature markets I expect to grow dramatically as more organizations put additional focus and budget on data and data science.
Statistical tools focus on ad hoc analysis and allow data scientists to do powerful things like run regressions and visualize data in a more easily digestible format. It's impossible to talk about statistical tools and not mention Microsoft Excel, a program used by data scientists, analysts and basically everyone else with a computer. Data scientists have done powerful things with Excel, and it was one of their best, original tools, so it has serious staying power despite serious flaws. In fact, in CrowdFlower’s recent survey of data science tools, we found that Excel is still the program data scientists use most.
Still, there are plenty of tools after that old mainstay. The programming language R is extremely popular as a way to analyze data and has a vast library of open-source statistical packages. Tableau is a great program for visualizing data, used by everyone from businesses to academics to journalists. Mathworks makes Matlab, which is an engineering platform unto itself, allowing users to not only create graphs but also build and optimize algorithms. SPSS and Stata have been around for decades and make it easy to do complicated analysis on large volumes of data.
Business intelligence tools are essentially statistical tools focused on creating clear dashboards and distilling metrics. You can think of them as tools that translate complicated data into a more readable, more understandable format for less technical people in your organization. Dashboards allow non-data scientists to see the numbers that are important to them upfront and make connections based on their expertise. Gartner pegs this as a $14 billion market, with the old guard like SAP, Oracle and IBM being the largest companies in the space. That said, there are upstarts here as well. Companies like Domo and Chartio connect to all manner of data sources to create attractive, useful dashboards. These are tools created for data scientists to show stakeholders their success in an organization as well as the health of the organization as whole.
Where those business intelligence tools are more about distilling data into easy-to-absorb dashboards, data mining and exploration software is concerned with robust, data-based insights. This is much more in line with the "sexy stuff" mentioned in the quote above. These companies aren't about just showing data off; they specialize in building something actionable from that data.
Unlike the third-party applications I wrote about in part one, these business intelligence tools are often open-ended enough to handle a wide array of use cases, from government to finance to business. For example, a company like Palantir can build solutions that do everything from enterprise cybersecurity to syncing law enforcement databases to disease response. These tools integrate and analyze data and often, once set up by a data scientist, they can provide the tools for anyone in an organization to become a sort of mini-data scientist, capable of digging into data to look for trends they can leverage for their own department's success. Platfora is a good example of this, but there are plenty more we'll see popping up in the coming years.
The last bit of our insights section centers around data collaboration. This is another space that's likely to be more and more important in future as companies build out larger data science teams. And if open data is going to become the new open source (and I think it has to), tools like Mode Analytics will become even more important. Mode lets data scientists share SQL-based analytics and reports with any member of their (or other organization). Silk is a really robust visualization tool that allows users to upload data and create a wide array of filterable graphs, maps and charts. R studio offers tools for data scientists to build lightweight ad hoc apps that can be shared with teams and helps non-data scientists investigate data. The fact that there are companies sprouting up to aid with this level of data collaboration is just further proof that data science isn't just growing; it's pretty much everywhere.
Again, it's hard to draw hard-and-fast lines here. A lot of these tools can be used by non-technical users or create dashboards or aid with visualization. But all of them are based on taking data and learning something with it. Our next section, models, is a bit different. It’s about building.
I need to start this section with a shout-out. Part of the inspiration for this project was Shivon Zilis' superb look at the machine intelligence landscape, and I mention it now because modeling and machine learning overlap rather significantly. Her look is in-depth and fantastic, and if this is a space you're interested in, it's required reading.
Models are concerned with prediction and learning. In other words: either taking a data set and making a call about what's going to happen, or training an algorithm with some labeled data and trying to automatically label more data.
The predictive analytics space encompasses tools that are more focused on doing regressions. These tools focus on not simply aggregating data or combining it or cleaning it but instead looking back through historical data and trends and making highly accurate forecasts with that data. For example, you might have a large data set that matches a person's credit score with myriad demographic details. You could then use predictive analysis to judge a certain applicant's creditworthiness based on their differences and similarities to the demographic data in your model. Predictive analysis is done by everyone from political campaign managers choosing where and when they need to place commercials to energy companies trying to plan for peaks and valleys in local power usage.
There are a whole host of companies that help with predictive analysis and plenty more on the way. Rapid Insights helps its customers build regressions that give insights into data sets. Skytree focuses on analytics on very large data sets. Companies like Numenta are trying to build machines that continuously learn and can spot patterns that are both important and actionable for the organizations running that data. But at their base level, they're about taking data, analyzing it and smartly forecasting events with that information.
Deep learning, on the other hand, is more of a technique than a solution. That said, it has suddenly become a very hot space because it offers the promise of much more accurate models, especially at very high volumes of training data. Deep learning seems to work best on images, and so most of the early companies doing deep learning tend to be focused on that. Facebook, in fact, had some early success training algorithms for facial recognition based on the face itself (as opposed to making assumptions about who might be whom based on overlapping friend circles and other relationships). Metamind offers a lightweight deep learning platform available to anyone for essentially any application. Dato packages many of the features in other categories, such as ETL and visualization.
Natural language processing tools, commonly referred to as NLPs, try to build algorithms that understand real speech. Machine learning here involves training those algorithms to detect the nuances of text, not just hunt for keywords. This means being able to identify slang, sarcasm, misspellings, emoticons and all the other oddities of real discourse. Building these tools requires incredibly large bodies of data, but NLPs have the potential to remove a lot of the cost and man-hours associated with everything from document processing to transcription to sentiment analysis. And those each are giant markets in their own right.
Probably the best-known illustration of NLPs in pop culture was Watson's performance on Jeopardy! That's actually a very instructive example. When you think of how Jeopardy! clues are phrased, with puns and wordplay and subtleties, the fact that Watson could understand those clues (let alone win its match) is an impressive feat. And that was in 2011; the space has grown immensely since. Companies like Attensity build NLP solutions for a wide variety of industries, while Maluuba has a more consumer-facing option that is, in essence, a personal assistant that understands language. Idibon focuses on non-English languages, an important market that is sometimes overlooked. I think we'll see a lot of growth here in the next decade or so, as these tools have the opportunity to truly transform hundreds of industries.
Lastly, let’s talk about machine learning platforms. While most of the tools above are more like managed services, machine learning platforms do something much different. A tool like Kaggle isn't so much a concrete product as it is a company that farms data out to data scientists and has them compete to create the best algorithm (a bit like the Netflix prize I mentioned in part one). Microsoft's Azure ML and Google's Prediction API fit well here because, like Kaggle, they can handle a wide array of data problems and aren't specifically bucketed into one specific field. Google’s Prediction API offers a black box learner that tries to model your input data, while Microsoft’s Azure ML gives data scientists a toolkit to put together pieces and build a machine learning workflow.
Probably because this category has the most ongoing research, there is quite a rich collection of open-source modeling and insights tools. R is an essential tool for most data scientists and works both as a programming language and an interactive environment for exploring data. Octave is a free, open-source port of matlab that works very well. Julia is becoming increasingly popular for technical computing. Stanford has an NLP library that has tools for most standard language processing tasks. Scikit, a machine learning package for python, is becoming very powerful and has implementations of most standard modeling and machine learning algorithms.
In the end, data application tools are what make data scientists incredibly valuable to any organization. They're the exact thing that allows a data scientist to make powerful suggestions, uncover hidden trends and provide tangible value. But these tools simply don't work unless you have good data and unless you enrich, blend and clean that data.
Which, in the end, is exactly why I chose to call this an ecosystem and not just a landscape. Data sources and data wrangling need to come into play before you get insights and models. Most of us would rather do mediocre analysis on great data than great analysis on mediocre data. When data is used correctly, a data scientist can do great analysis on great data. And that's when the value of a data scientist becomes immense.
This article is published as part of the IDG Contributor Network. Want to Join?