Review: Spark lights a fire under big-data processing

Apache Spark brings high-speed, in-memory analytics to Hadoop clusters, crunching large-scale data sets in minutes instead of hours.

Become An Insider

Sign up now and get FREE access to hundreds of Insider articles, guides, reviews, interviews, blogs, and other premium content. Learn more.

Apache Spark got its start in 2009 at UC Berkeley’s AMPLab as a way to perform in-memory analytics on large data sets. At that time, Hadoop MapReduce was focused on large-scale data pipelines that were not iterative in nature. Building analytic models on MapReduce in 2009 was a very slow process, so AMPLab designed Spark to help developers perform interactive analysis of large data sets and to run iterative workloads, such as machine-learning algorithms, that repeatedly process the same data sets in RAM.

Spark doesn’t replace Hadoop. Rather, it offers an alternative processing engine for workloads that are highly iterative. By avoiding costly writes to disk, Spark jobs often run many orders of magnitude faster than Hadoop MapReduce. By "living" inside the Hadoop cluster, Spark uses the Hadoop data layer (HDFS, HBase, and so on) for the end points of the data pipeline, reading raw data and storing final results.

To continue reading this article register now

Shop Tech Products at Amazon