Data science isn't new, but the demand for quality data has exploded recently. This isn't a fad or a rebranding, it's an evolution. Decisions that govern everything from successful presidential campaigns to a one-man startup headquartered at a kitchen table are now based on real, actionable data, not hunches and guesswork.
Because data science is growing so rapidly, we now have a massive ecosystem of useful tools. I've spent the past month or so trying to organize this ecosystem into a coherent portrait and, over the next few days, I'm going to roll it out and explain what I think it all means.
Since data science is so inherently cross-functional, many of these companies and tools are hard to categorize. But at the very highest level, they break down into the three main parts of a data scientist's work flow. Namely: getting data, wrangling data and analyzing data. I'll be covering them in that real-world order, starting first with getting data, or data sources.
What's the point of doing this?
I spend a ton of time talking to data scientists about how they work, what their challenges are and what makes their jobs easier. There are of course thousands of tools in the data science toolbox, so this ecosystem is by no means exhaustive. But the software and companies I've heard most often are all included, as well as the open source programs that often drive and inform the tools themselves.
In fact, this is where I think the distinction between a statistician and a data scientist is more than semantic. In my view, statisticians take data and run a regression. Data scientists actually fetch the data, run the regression, communicate these findings, show patterns and lead the way towards actionable, real-world changes in their organization, regardless of what that organization actually does. Since they need to oversee the entire data pipeline, my hope is that this ecosystem shows many of the important tools data scientists use, how they use them and, importantly, how they interact together.
Let's get started.
The rest of this ecosystem doesn’t exist without the data to run it. Broadly speaking, there are three very different kinds of data sources: databases, applications and third-party data.
Structured databases predate unstructured ones. The structured database market is somewhere around $25 billion and you'll see big names like Oracle in our ecosystem along with a handful of upstarts like MemSQL. Structured databases store a finite amount of data columns, generally run on SQL and are usually used by the sort of business functions where perfection and reliability are of paramount concern, i.e. finance and operations.
One of the key assumptions of most structured databases is that queries run against it must return consistent, perfect results. A good example of who might absolutely need to run on a structured database? Your bank.
They're storing account information, personal markers (like your first and last name), loans their customers have taken out, etc. A bank must always know exactly how much money you have in your account, down to the penny.
And then there are unstructured databases. It's no shock that these were pioneered by data scientists, because data scientists look at data differently than an accountant would. Data scientists are less interested in absolute consistency and more interested in flexibility. Because of that, unstructured databases lower the friction for storing and querying lots of data in different ways.
I'd say that a lot of the popularity of unstructured databases was born directly out of Google's success. Google was trying to store the internet in a database. Think of how ambitious and utterly gigantic that task is. MapReduce, a technology that powered this database, was in some ways less powerful than SQL, but it allowed the company to adapt and grow their data stores as they saw fit. It allowed Google to use that database in ways they simply didn't foresee when they were starting out.
For example, Google could query across all websites, asking which sites linked to other sites and modify its search results for its customers. This scalable flexible querying gave Google a huge competitive advantage, which is why Yahoo and others massively invested in building an open source version of this technology called Hadoop.
Additionally, unstructured databases often require less server space. There were major Internet companies that just a few years ago would wipe their databases every three months because it was too expensive to store everything. This kind of logic is unthinkable now.
Having all that data allows companies to build everything from frighteningly powerful recommendation engines to world-class translation systems to incredibly efficient inventory management. Unstructured databases generally aren't as infallible as structured databases, but that's worth the tradeoff for many applications, especially in the data science world.
For example, say your unstructured database is running on 1000 machines and one is down. It's okay if the recommendation engine that's calling out to those machines uses 99 pieces of data instead of 100 to suggest you watch a Patrick Swayze movie. The priority for this sort of database is flexibility, scale, and speed. It's okay that it can sometimes be inexact.
One of the more well known examples of a company that creates unstructured database software is Cloudera, which runs on Hadoop. And to show you how much this space is growing, consider this: Seven years ago I got calls from VCs that assumed their market would be ten or fifteen companies globally. A year ago, they raised nearly a billion dollars.
As data scientists have become the biggest consumers of data (as opposed to finance and accounting), get used to hearing more and more about unstructured databases.
In the last ten years storing critical business data in the cloud has changed from unthinkable to common practice. This has been maybe the biggest shift in business’s IT infrastructure.
I've noted four major examples in the application space in the ecosystem (sales, marketing, product and customer), but these days every single business function has several SaaS applications to choose from.
The trend probably started with SalesForce. They were the first very successful enterprise data application that decided to create and target their software to their end user, not a CIO. Because of this, SalesForce was building and iterating on software directly for sales teams and not to the whims of individual CIOs. They built something that worked great for their users, and in the process showed that enterprise customers would be willing to entrust critical company data in the cloud.
Instead of sales data living in-house in a custom-installed database, it now lives in the cloud where a company, whose entire lifeblood is based on making that data usable and robust, takes care of it. Other companies quickly followed suit. Now, essentially every department of a business has a data application marketed to and made for that department. Marketo stores marketing data, MailChimp runs your email, Optimizely crunches A/B testing data for you, Zendesk lets you know how happy your customers are. The list goes on and on.
Why's that relevant? Now every department of a business has a powerful set of data for data scientists to analyze and use in predictive analysis. The volume of data is great, but now it’s scattered across multiple applications.
Say you wanted to look at a specific customer in your SugarCRM app. Are you trying to see how many support tickets they've written? That’s probably in your ZenDesk app. Are you making sure they've paid their most recent bill? That lives in your Xero app. That data all lives in different places, on different sites, in different databases.
As business move to the cloud they collect more data but it’s scattered across applications and servers all over the world.
Third party data is much, much older than unstructured databases or data applications. Dun & Bradstreet is, at its heart, a data seller that's been around since 1841. But as the importance of data for every organization grows, this is a space that's going to keep evolving over the coming years.
Broadly, I've broken this part of our ecosystem out to four areas: business information, social media data, web scrapers and public data.
Business information is the oldest. I mentioned Dun & Broadstreet above, but business data sellers are vital to nearly any organization dealing with those businesses. Business data answers the critical question for any B2B company: Who should my sales team be talking to? These days, that data has been repurposed for many other applications from online maps to high frequency trading. Upstart data sellers like Factual don’t just sell business data, but they do tend to start there because it’s so lucrative.
Social media data is new but growing rapidly. It's a way for marketers to prove their efforts are making a tangible impact and getting sentiment analysis on social data is what smart PR firms do to take their temperature of their brands and demonstrate their value. Here, you'll find everything from Radian6 to DataSift.
Then there's web scraping. Personally, I think this is going to be a gigantic space.
If we can get to the point where any website is a data source that can be leveraged and analyzed by smart data science teams, there's really no telling what new businesses and technologies are going to be born from that. Right now, some of the players are import.io and kimono, but I think this space is going to explode in the coming years.
Finally, I'd be remiss if I didn't mention public data. I'm not sure President Obama would have gotten elected without the team of data scientists he employed during his 2004 campaign, and I think some of the lessons he learned about the power of data were the reason he spearheaded Data.gov.
A lot of local governments have followed that lead. Amazon Web Services houses some amazing public data (everything from satellite imagery to Enron emails). These are giant datasets that can help power new businesses, train smarter algorithms, and solve real-world problems. The space is growing so fast we even see a company, Enigma.io, that exists for the sole purpose of helping companies use all the public datasets out there.
Open source tools
There has been a massive expansion of the number of open-source data stores, especially unstructured data stores with Cassandra, Redis, Riak, Spark, CouchDB and MongoDB being some of the most popular. This post focuses mostly on companies but another blog post, Data Engineering Ecosystem, An Interactive Map gives a great overview of the most popular open source data storage and extraction tools.
Next up? I’ll be posting about the data wrangling phase of the process later this week. See you then.
This article is published as part of the IDG Contributor Network. Want to Join?