It's no secret that data volumes are growing exponentially. What's a bit more mysterious is figuring out how to unlock the value of all of that data. A big part of the problem is that traditional databases weren't designed for big data-scale volumes, nor were they designed to incorporate different types of data (structured and unstructured) from different apps.
Lately, Apache Hadoop, an open-source framework that enables the processing of large data sets in a distributed environment, has become almost synonymous with big data. With Hadoop, end users can run applications on systems composed of thousands of nodes that pull in thousands of terabytes of data.
According to Gartner estimates, the current Hadoop ecosystem market is worth roughly $77 million. The research firm expects that figure to balloon to $813 million by 2016.
Here are 10 startups hoping to grab a piece of that nearly $1 billion pie. These startups were chosen and ranked based on a combination of funding, named customers, competitive positioning, the track record of its executives, and the ability to articulate a real-world problem and explain why the startup's solution is an ideal one to solve it.
(Please note that this lineup favors newer startups. As a result, some big, well-funded names have been left off, such as Cloudera, Datameer, DataStax, and MapR Technologies, simply because they've been around longer than most in this new market sector.)
What They Do: Provide a big data analytics solution that transforms raw data in Hadoop into interactive, in-memory business intelligence.
Headquarters: San Mateo, Calif.
CEO: Ben Werther, who formerly served as vice president of products at DataStax.
Funding: $65 million to date. The latest round ($38 million Series C) was locked down in March. Tenaya Capital led the round, while Citi Ventures, Cisco, Allegis Capital, Andreessen Horowitz, Battery Ventures, Sutter Hill Ventures, and In-Q-Tel all participated.
Why They're on This List: As with many startups on this list, Platfora was founded in order to simplify Hadoop. While businesses have been rapidly adopting Apache Hadoop as a scalable and inexpensive solution to store massive amounts of data, they struggle to extract meaningful value from that data. The Platfora solution masks the complexity of Hadoop, which makes it easier for business analysts to leverage their organization's myriad data.
Platfora tries to simplify the data collection and analysis process, automatically transforming raw data in Hadoop into interactive, in-memory business intelligence, with no ETL or data warehousing required. Platfora provides an exploratory BI and analytics platform designed for business analysts. Platfora gives business analysts visual, self-service analytical tools that help them navigate from events, actions, and behaviors to business facts.
Customers include Comcast, Disney, Edmunds.com and the Washington Post.
Competitive Landscape: Platfora competes with the likes of Datameer, Tableau, IBM, SAP, SAS, Alpine Data, and Rapid-I.
Key Differentiator: Platfora claims to have the first scale-out in-memory Big Data Analytics platform for Hadoop. Platfora's focus on simplifying Hadoop and Big Data analysis is becoming a more common goal of late, but they are an early mover in this respect.
What They Do: Provide a Hadoop-based data analysis platform.
Headquarters: San Francisco, Calif.
CEO: Joe Otto, formerly senior vice president of sales and service at Greenplum.
Funding: $23.5 million in total funding, including $16 in Series B Funding, from Sierra Ventures, Mission Ventures, UMC Capital and Robert Bosch Venture Capital.
Why They're on This List: Most executives and managers don't have the time or skills to code in order to glean data insights, nor do they have the time to learn about complex new infrastructures like Hadoop. Rather, they want to see the big picture. The trouble is that complex advanced analytics and machine learning typically require scripting and coding expertise, which can limit access to data scientists. Alpine Data mitigates this issue by making predictive analytics accessible via SaaS.
Alpine Data provides a visual drag-and-drop approach that allows data analysts (or any designated user) throughout an organization to work with large data sets, develop and refine models, and collaborate at scale without having to code. Data is analyzed in the live environment, without migrating or sampling, via a Web app that can be locally hosted.
Alpine Data leverages the parallel processing power of Hadoop and MPP databases and implements data mining algorithms in MapReduce and SQL. Users interact with their data directly where it already sits. Then, they can design analytics workflows without worrying about data movement. All this is done in a Web browser, and Alpine Data then translates these visual workflows into a sequence of in-database or MapReduce tasks.
Customers include Sony, Havas Media, Scala, Visa, Xactly, NBC, Avast, BlackBerry, and Morgan Stanley.
Competitive Landscape: Alpine will compete both with large incumbents (SAS, IBM, SPSS, and SAP) and such startups as Nuevora, Platfora, Skytree, Revolution Analytics, and Rapid-I.
Key Differentiator: Alpine Data Labs argues that most competing solutions are either desktop-based or a point solutions without any collaborative capability. In contrast, Alpine Data offers a "SharePoint-like" feel to it. On top of collaboration and search, it also provides modeling and machine learning under the same roof. Alpine is also part of the No-Data-Movement camp. Regardless if a company's data is in Hadoop or MPP Database, Alpine sends out instructions, via its In-Cluster Analytics, without ever moving data.
What They Do: Provide Hadoop-as-a-Service (HaaS).
Headquarters: Palo Alto, Calif.
CEO: Raymie Stata, who was previously CTO of Yahoo.
Founded: March 2012
Funding: Altiscale is backed by $12 million in Series A funding from General Catalyst and Sequoia Capital, along with investments from individual backers.
Why They're on This List: Hadoop has become almost synonymous with Big Data, yet the number of Hadoop experts available in the wild cannot hope to keep up with demand. Thus, the market for HaaS should rise in step with big data. In fact, according to TechNavio, the HaaS market will top $19 billion by 2016.
Altiscale's service is intended to abstract the complexity of Hadoop. Altiscale's engineers set up, run, and manage Hadoop environments for their customers, allowing customers to focus on their data and applications. When customers' needs change, services are scaled to fit -- one of the core advantages of a cloud-based service.
Customers include MarketShare and Internet Archive.
Competitive Landscape: The HaaS space is heating up. Competitors comes from incumbents, such as Amazon Elastic MapReduce (EMR), Microsoft's Hadoop on Azure, and Rackspace's service based on Hortonworks' distribution. Altiscale will also compete directly with Hortonworks and with such startups as Cloudera, Mortar Data, Qubole, and Xpleny.
Key Differentiator: Altiscale argues that they are "the only firm to actually provide a soup-to-nuts Hadoop deployment. By comparison, AWS forces companies to acquire, install, deploy, and manage a Hadoop implementation -- something that takes a lot of time."
What They Do: Provide a platform that enables users to transform raw, complex data into clean and structured formats for analysis.
Headquarters: San Francisco, Calif.
CEO: Joe Hellerstein, who in addition to serving as Trifacta's CEO is also a professor of Computer Science at Berkeley. In 2010, Fortune included him in their list of 50 smartest people in technology, and MIT Technology Review included his Bloom language for cloud computing on their TR10 list of the 10 technologies "most likely to change our world."
Funding: Trifacta is backed by $16.3 million in funding raised in two rounds from Accel Partners, XSeed Capital, Data Collective, Greylock Partners, and individual investors.
Why They're on This List: According to Trifacta, there is a bottleneck in the data chain between the technology platforms for Big Data and the tools used to analyze data. Business analysts, data scientists, and IT programmers spend an inordinate amount of time transforming data. Data scientists, for example, spend as much as 60 to 80 percent of their time transforming data. At the same time, business data analysts don't have the technical ability to work with new data sets on their own.
To solve this problem, Trifacta uses "Predictive Interaction" technology to elevate data manipulation into a visual experience, allowing users to quickly and easily identify features of interest or concern. As analysts highlight visual features, Trifacta's predictive algorithms observe both user behavior and properties of the data to anticipate the user's intent and make suggestions without the need for user specification. As a result, the cumbersome task of data transformation becomes a lightweight experience that is far more agile and efficient than traditional approaches. Lockheed Martin and Accretive Health are early customers.
Competitive Landscape: Trifacta will compete with Paxata, Informatica and CirroHow.
Key Differentiator: Trifacta argues that the problem of data transformation requires a radically new interaction model -- one that couples human business insight with machine intelligence. Trifacta's platform combines visual interaction with intelligent inference and "Predictive Interaction" technology to close the gap between people and data.
What They Do: Provide a Hadoop-based, SQL-compliant database designed for big data applications.
Headquarters: San Francisco, Calif.
CEO: Monte Zweben, who previously worked at the NASA Ames Research Center where he served as the Deputy Branch Chief of the Artificial Intelligence Branch. He later founded and served as CEO of Blue Martini Software.
Funding: They are backed by $19 million in funding from Interwest Partners and Mohr Davidow Ventures.
Why They're on This List: Application and Web developers have been moving away from traditional relational databases due to rapidly growing data volumes and evolving data types. New solutions are needed to solve scaling and schema issues. Splice Machine argues that even a few short months ago Hadoop, while viewed as a great place to store massive amounts of data, wasn't ready to power applications.
Now, with emerging database solutions, features that made RDBMS so popular for so long, such as ACID compliance, transactional integrity, and standard SQL, are available on top of the cost-effective and scalable Hadoop platform. Splice Machine believes that this enables developers to get the best of both worlds in one general-purpose database platform.
Splice Machine provides all the benefits of NoSQL databases, such as auto-sharding, scalability, fault tolerance, and high availability, while retaining SQL, which is still the industry standard. Splice Machine optimizes complex queries to power real-time OLTP and OLAP applications at scale without rewriting existing SQL-based apps and BI tool integrations. By leveraging distributed computing, Splice Machine can scale from terabytes to petabytes by simply adding more commodity servers. Splice Machine is able to provide this scalability without sacrificing the SQL functionality or the ACID compliance that are cornerstones of an RDBMS.
Competitive Landscape: Competitors include Cloudera, MemSQL, NuoDB, Datastax, and VoltDB.
Key Differentiator: Splice Machine claims to have the only transactional SQL-on-Hadoop database that powers real-time big data applications.
What They Do: Provide a real-time stream processing platform built on Hadoop.
Headquarters: Santa Clara, Calif.
CEO: Phu Hoang, who was previously a founding member of the engineering team at Yahoo, where he served as executive vice president of engineering.
Funding: The company closed an $8 million Series A round in June 2013. August Capital led the round and was joined by AME Cloud Ventures. The company previously secured $750K in seed funding from Morado Ventures and Farzad Nazem.
Why They're on This List: DataTorrent argues that we'll soon start thinking about latency issues when we think about Big Data solutions. DataTorrent points out that "data is happening now, streaming-in from various sources -- in real-time, all the time." Many organizations struggle to process, analyze, and act on this never-ending and ever-growing stream of information -- at all.
For some insights, by the time data is stored to disk, analyzed, and responded to -- it's already too late. For instance, if a hacker compromises a credit card account and manages to make a few purchase, plenty of damage has already been done, even if that account is cut off within minutes. DataTorrent contends that an organization's ability to recognize and react to events instantaneously isn't just a business advantage. In today's word, it is a necessity.
Unlike traditional batch processing that can take hours, DataTorrent claims to be able to execute hundreds of millions of data items per second. This enables organizations to process, monitor, and make decisions based on their data in real-time.