Impact of the SQL-Hadoop marriage on infrastructure

With Hadapt, Cloudera, Greenplum and others offering SQL overlays for Hadoop, the unification of SQL and Hadoop is here and is here to stay. Regardless of what the naysayers say, the benefits of this marriage are clear - it enables businesses to feed and analyse all of their structured, unstructured, and semi-structured data on to a single platform. So what does this mean from an infrastructure perspective? Will it be business as usual?

I recently spoke with Justin Borman, the CEO of Hadapt. Hadapt has developed a solution that unifies SQL and Hadoop - which it calls the Adaptive Analytics platform. In a nutshell, it brings native SQL to Hadoop.

Traditionally Hadoop has been associated with NoSQL databases. Hadapt's platform combines Hadoop with a relational data store. The resulting platform can be used to perform SQL-based analytics of data that resides in Hadoop. No more using connectors, bridges or convertors. Hadapt says that its solution is Hadoop distribution agnostic and can integrate seamlessly into distributions from Cloudera, MapR, Hortonworks and even Intel.

Hadapt is several of the vendors trying to ride the paradigm shift from appliance-based computing (that most traditional relational database based analytics solutions are based on) to distributed computing on clusters of inexpensive commodity hardware. Along with Cloudera (Project Impala), Apache Hive, EMC Greenplum and others, Hadapt is making the latest advances in RDBMS technology available on Hadoop.

At IDC Directions, I moderated a panel on Big Data. The theme for the panel was the disruptive impact of Big Data on infrastructure. During the conversations amongst panellists, the topic of Hadoop came up. The focus quickly shifted from the "wow" to the "challenges" and what this meant to businesses as they increase their dependency on Hadoop:

  • It is widely reported and considered a given that Hadoop has become a network and compute problem. One of the prime contributors to this problem is the bandwidth involved in converting and subsequently feeding data into Hadoop. Getting data out of relational databases, converting it and then feeding it into Hadoop is disruptive and time consuming.
  • At the same time, as businesses turn from search to discovery, they need a holistic but a singular approach to analytics. This approach should examine all data sets - not just unstructured or semi-structured ones but also structured data sets. It is also more efficient from an infrastructure perspective to run the analytics algorithms when the data is centrally located. It also reduces time to analyse and the drag on human capital.
  • There is the little known fact that Hadoop can actually be used as persistent storage for all data types, including structured data from relational databases. The ability to use Hadoop for long term archival of structured data, from databases and the ability to have consistent SQL front-end access to this persistent data store could reduce network bandwidth and compute cycles.
  • Finally the ability to use Hadoop as the persistent data store means that businesses will now have an alternative to expensive storage as the only place to store archived data. They can leverage commodity hardware for archived data and what's more they can run analytics on this data in the same way they're used to doing so with the traditional data warehouses.

With the rising popularity of solutions from Hadapt, Cloudera and others businesses will be able to minimise the overhead caused by data shipping for analytics, centralise all (unstructured, structured and semi-structured) data in a central location and finally push the envelope on Hadoop utilisation. Expect Oracle - and others - to fight back.

Posted by Ashish Nadkarni

Copyright © 2013 IDG Communications, Inc.

Download: EMM vendor comparison chart 2019
Shop Tech Products at Amazon