Cascading 2.0 works to ease MapReduce pain

An alternate API for Hadoop applications

The difficulty of creating MapReduce jobs to query Hadoop-stored dataset is something that can be circumvented more easily with today's release of Cascading 2.0, an application framework that enable Java developers to quickly and easily build Big Data applications on Apache Hadoop--without the MapReduce mess.

To give you an idea about the import of this news, ask someone in the data sector if they like programming MapReduce jobs within Hadoop.

Very likely they will emphatically tell you "no," and if they tell you yes, they love it, perhaps you should make no sudden moves and back away slowly.

That's because MapReduce, while powerful, is almost universally regarded as a complicated and difficult framework to use, even for Java developers. Cascading is an alternative API to connect with Hadoop that will deliver the same results in ideally an easier fashion, according to Concurrent Inc. CEO Chris Wensel.

While Cascading is a stand-alone open source project licensed under the Apache Software License (ASL) 2.0, Concurrent serves as its primary commercial sponsor. The company has been around since 2008, but has been quietly deploying earlier versions of Cascading to big-name customers like Twitter and Etsy.

It is easy to compare Cascading to interface tools like Pig, which enables ad hoc analytics on a Hadoop dataset, or Hive, which enables the creation of smaller, more easily managed datasets from within Hadoop. And indeed, Cascading sort of fits in that category. But there are key differences.

Cascading, for instance, enables a more flexible scheduling system than the native Hadoop system, Wensel explained to me in a conversation this week. Hadoop's default scheduler is a first-in-first-out type of system, which can leave you stranded if you have a little MapReduce job punched in right behind a massive set of MapReduce jobs.

Cascading 2.0 also lets developers detach their Hadoop applications, meaning that they can run in memory and be tested on smaller datasets. So, application developers can more quickly build and test applications on their desktops in the language of choice (Java, Jython, Scala, Clojure or Jruby) with familiar constructs and reusable components, and then deploy them onto their production Hadoop cluster.

This "detachability" has farther-reaching implications: by componentizing Hadoop-related jobs and applications, businesses using Cascading can now get a clearer picture of the time and effort put into working with Hadoop. This, Wensel said, is hard to do when working with hundreds of MapReduce jobs at any given time.

It was important for Concurrent to solve this particular problem, because increasingly Wensel sees a need to be able to unwind Hadoop application development from the development process so it can be properly costed out and managed. If there's a problem, such as tainted data, knowing what apps touched the data and when is critical, and something that's very difficult to discover with MapReduce jobs.

Wensel was very emphatic on how Cascading (and Concurrent with it) was positioned within the Hadoop ecosystem.

"We're not competing with Cloudera," Wensel said. "We're not managing Hadoop."

The best analogy Wensel drew for describing what Cascading does for Hadoop is likening it to services mega-company SAP.

"What we're doing for Hadoop is the same thing that SAP is doing for SQL," he explained.

Given the near-universal headaches that MapReduce seems to cause, I had to ask Wensel: why aren't people using Cascade exclusively to deal with Hadoop?

"Every once in a while you run into a bottleneck when working with abstractions," Wensel replied. "Sometimes you just have to write in assembly language. That's not the best analogy, but it's close to why MapReduce can't be dismissed.

"Besides, you can't really get around it," Wensel added. "In order to completely understand Hadoop, you have to understand MapReduce."

Cascading 2.0 is under general availability today, free of charge under the ASL.

Read more of Brian Proffitt's Zettatag and Open for Discussion blogs and follow the latest IT news at ITworld. Drop Brian a line or follow Brian on Twitter at @TheTechScribe. For the latest IT news, analysis and how-tos, follow ITworld on Twitter and Facebook.

This story, "Cascading 2.0 works to ease MapReduce pain" was originally published by ITworld.

Copyright © 2012 IDG Communications, Inc.

Shop Tech Products at Amazon