Researchers: Databases still beat Google's MapReduce
Speed, efficiency of parallel SQL databases superior, paper shows
Computerworld - A team of researchers will release on Tuesday a paper showing that parallel SQL databases perform up to 6.5 times faster than Google Inc.'s MapReduce data-crunching technology.
Google bypassed parallel databases and invented MapReduce as a way to index the World Wide Web on its global grid of low-end PC servers. As of January 2008, Google has used MapReduce to process 20 petabytes of data a day.
In results of in-house tests published last November, Google used MapReduce running on 1,000 servers to sort 1TB of data in just 68 seconds.
Such results have won MapReduce and its open-source version Hadoop many fans, who argue that the technology is already superior to the 40-year-old relational one for large-scale grids such as for cloud-computing infrastructures, and will eventually render databases obsolete for other tasks.
Microsoft technical fellow David DeWitt and Michael Stonebraker, a database industry legend and chief technology officer at Vertica Systems Inc., who co-authored the paper, have previously argued that MapReduce lacks many key features already standard to databases and was generally a "major step backward."
The paper, titled "A Comparison of Approaches to Large-Scale Data Analysis," viewable here. It is sure to stoke heated discussion among data junkies over the technical merits of each approach. It will be published by the Association for Computing Machinery (ACM), a 92,000-member IT society, in the June 29-July 2 issue of its SIGMOD Record journal of data management.
In addition to DeWitt and Stonebraker, five researchers from Brown University, Yale University, MIT and the University of Wisconsin co-authored the report.
In the paper, DeWitt and Stonebraker put meat on their argument by testing two 100-node parallel, "shared-nothing" database clusters, one running the column-based Vertica and another running a row-based database from "a major relational vendor," against a similarly configured MapReduce one of the same size. Servers had 2.4-GHz Intel Core 2 Duo processors running 64-bit Red Hat Enterprise Linux with 4GB of RAM and two 250GB SATA-I hard drives all connected by Gigabit Ethernet ports.
Their conclusion? Databases "were significantly faster and required less code to implement each task, but took longer to tune and load the data," the researchers write. Database clusters were between 3.1 and 6.5 times faster on a "variety of analytic tasks."
MapReduce also requires developers to write features or perform tasks manually that can be done automatically by most SQL databases, they wrote.
MapReduce may be "well suited for development environments with a small number of programmers and a limited application domain," they said. "This lack of constraints, however, may not be appropriate for longer-term and larger-sized projects."
- Google I/O 2013's Coolest Products and Services
- 10 Star Trek Technologies That are Almost Here
- 19 Generations of Computer Programmers
- 25 Must-Have Technologies for SMBs
- A walking tour: 33 questions to ask about your company's security
- 15 social media scams
- The 7 elements of a successful security awareness program
- IT Certification Study Tips
- Register for this Computerworld Insider Study Tip guide and gain access to hundreds of premium content articles, cheat sheets, product reviews and more.
- Intelligent Systems: A Prescription for Health Care Transformation Facing an onslaught of regulatory changes and market pressures, health care providers are grappling with how to transform existing services as part of...
- Agile Computing: The Path to Strategic Agility Financial institutions globally are facing increasing regulatory requirements while operating in a more competitive environment. Learn how to leverage technology to transform your...
- Time Savings and Ease of Deployment Comparison Study - Database Appliance vs Microsoft SQL Server As the amount and importance of corporate data grows, companies of all sizes are finding that they increasingly need to deploy high-availability database...
- Case Study: Hospital Turns to Email Archiving Solution to Ensure Regulatory Compliances Read this case study to learn how a cloud-based email archiving solution enabled the hospital to meet government mandates and helps avoid thousands...
- Oracle Database Appliance Best Practices Business users increasingly demand 24x7 availability of their data while IT departments face the challenge of ensuring maximum availability while operating with limited...
-
Oracle Database Appliance - Simplifying your High Availability Database
Date: February 29, 2012
Time: 1:00 PM EST
Seasoned IT managers know from experience that in many cases the bulk of the cost of an...
All Databases White Papers |
Webcasts