Big data SMAQ-down

The term "big data," is getting thrown around a lot these days, and in certain circles it is threatening to overtake "cloud" as the most overused and misused term in IT.

Interestingly, some of the large, traditional storage vendors are embracing the term big data, using it as an umbrella term for all large collections of data and hence an umbrella term for all of their offerings. A more nuanced understanding of big data actually shows it to be the antithesis of both the technology and the business models of the traditional storage vendors.

Let's start with a few definitions before diving in. An emerging consensus is that big data does not refer just to large amounts of data, but specifically to "a collection of data that becomes large enough that it cannot be processed using conventional methods." What are those "conventional methods?" To answer that, it is helpful to look at what the "conventional problems" the methods were designed to solve.

For the majority of history for enterprise IT, the big problem was ensuring that transactional systems (e.g. online transaction processing) ran smoothly, quickly and precisely. As I mentioned in a previous post, this drove an approach to IT with proprietary relational databases on proprietary monolithic servers with proprietary and monolithic storage.  

The traditional IT stack that worked so well for a relatively small amount of highly valuable and highly structured data began to collapse when faced with a number of challenges. For example, the emergence of web-scale applications (and web-scale userbases) drove a need for an approach (i.e. LAMP stack) that enabled the distribution of computing and serving across large numbers of commodity servers. Similarly, the explosive growth in unstructured data (e.g. files such as video, audio, medical images, seismic data, etc.) has led to a need for similar commoditization of the storage hardware.

To some extent, big data represents a set of challenges to virtually the entire conventional IT stack -- database, compute and storage. This is why a new stack - - Storage, MapReduce and Query (hence SMAQ) -- is needed. And, just as the LAMP stack transformed IT, I believe that the big data SMAQ stack will transform IT yet again.

Again, understanding the "problem" is helpful here. Imagine, for example, that you are not only trying to store and serve billions of documents, but are also trying to perform sophisticated analytics on those files, such as tabulating the frequency of words within those documents or analyzing the pattern of linkages between those documents. If this sounds like the sort of problem that a company like Google or Yahoo would face, that is no accident. Many of the exciting technologies around big data (Hadoop, NoSQL, etc.) came from projects developed inside the large web companies to address these challenges.

Or suppose that you are not only storing millions of files relating to weather data but are also trying to analyze that data for climate change patterns. The problem is not just storing lots of data, but leveraging that data for new and (hopefully) profound insights into patterns and trends.

Conventional databases are not up to the tasks described above. The various design constraints that make relational databases so good at problems like maintaining transaction history (the ACID principles) place limits on how large those databases can scale. Fortunately, the kind of analysis described above generally has less of a need for absolute precision.

Similarly, conventional storage and computing are not up to the task. The kinds of analyses and storage described above are best performed by distributing the data and storage to large numbers of commodity storage devices and by distributing the computation to large numbers of compute devices. As you might imagine, input data is first processed, item by item, on all of these distributed devices and transformed into an intermediate data set. Then these intermediate results are reduced to a summarized data set, which is the desired end result. These two phases, called Map and Reduce, make up the "M" in the SMAQ acronym.

Figure1_blog4.png

We've recently seen an explosion in the available technologies for the M, A and Q parts of the acronym. For open source enthusiasts like me, the most exciting seems to focus around the Hadoop ecosystem, and its menagerie of projects like Hive, Pig, et. al.

As is typically the case, storage needs to catch up with the rest of the IT stack. The data set itself (almost by definition) needs to be highly distributed. Both the data and the computing for big data will happen on large numbers of heterogeneous, distributed and far flung devices.

Furthermore, since it is generally easier to move the computing to the data than to move the data to the computing, the storage part of the SMAQ stack needs to ensure that all that unstructured and semi-structured data is efficiently and safely distributed to all of the compute nodes in a manner that is both scalable and ensures sufficient high performance. This means that big data storage must:

a)      Work on large numbers of heterogeneous commodity devices distributed across the broader Internet

b)      Deliver the kind of performance expected for the intensive number crunching associated with analysis

c)       Avoid design mistakes like centralized metadata stores and the 16 TB volume size limits imposed by most legacy systems

d)      Allow the computing and storage functions to happen on the same hardware. As mentioned above, it is generally much less expensive to move the computing to the storage than vice versa. But that's a hard thing to do if the storage is locked down and proprietary.

e)      Scale to petabyte and even exabyte scale

My conclusion: the proprietary and monolithic approach to storage simply won't work for big data. But, as this big data SMAQ-down progresses, I'm sure we'll see that the transformation in IT and storage will unlock value both in terms of the insights gained from big data and in terms of the economics of storing and managing that data.

 Ben Golub is President and CEO at Gluster. He is on Twitter @golubbe.  

Copyright © 2011 IDG Communications, Inc.

  
Shop Tech Products at Amazon