Skip the navigation
)
News

Google claims MapReduce sets data-sorting record, topping Yahoo, conventional databases

By Eric Lai
November 24, 2008 12:00 PM ET

Computerworld - Google Inc. late last week claimed that results of in-house data-sorting tests bolster its claims that its MapReduce technology can manipulate more data faster than any conventional database.

According to a Friday afternoon blog post by Grzegorz Czajkowski, a member of Google's systems infrastructure team, MapReduce recently sorted 1 terabyte of data in 68 seconds, or about a third of the time Yahoo Inc. achieved this summer.

Sorting or rearranging data is one of the most basic functions of a spreadsheet, database or other data-manipulation software.

Google used 1,000 servers running MapReduce in parallel to sort the data, versus 910 for Yahoo, according to Czajowksi.

Google also tested MapReduce's ability to sort 1 petabyte, or 1,000 TB, of data. That is equivalent to 12 times the amount of archived Web data in the U.S. Library of Congress as of May 2008, according to Google.

Using 4,000 servers, which is likely a small fraction of Google's entire worldwide server infrastructure, MapReduce took 6 hours, 2 minutes to sort 1PB, according to Czajkowski.

"We're not aware of any other sorting experiment at this scale and are obviously very excited to be able to process so much data so quickly," he wrote.

Czajkowski did not say when the tests were done. He did reveal that as of early January this year, Google was processing an average of 20 PB total per day.

By comparison, the largest publicly known data warehouses today store several petabytes of data total, only processing a tiny fraction of that amount each day.

Google's announcement appeared to be deliberately timed to coincide with a speech by a noted database expert and MapReduce critic David DeWitt.

A former computer science professor at the University of Wisconsin, Madison, DeWitt joined Microsoft this spring to run a new research lab being created on the Madison campus.

The lab will focus on helping Microsoft's SQL Server "scale out" in order to run on hundreds or thousands of servers at a time. That will allow customers to run parallel database clusters similar technically to Google's, though nowhere near the latter's scale.

Early this year, DeWitt, along with database industry legend Michael Stonebraker, co-wrote a blog arguing that MapReduce was a "sub-optimal ... not novel" type of database that lacked many features that modern database administrators and developers take for granted and that was unworthy of the hype it has received.

In an interview last week with Computerworld, DeWitt praised MapReduce's scalability and hardiness.

But DeWitt also stood firm on MapReduce's shortcomings. He and Stonebraker are also submitting a paper to the Association of Computing Machinery (ACM) that compares the performance of several databases, IBM's DB2 and Stonebraker's Vertica, with MapReduce and another similar nonrelational data engine, Apache Hadoop. That paper may be publicly available as early as late January, said DeWitt.

DeWitt gave a keynote speech on Friday at the Professional Assocation for SQL Server's (PASS) conference in Seattle.

He did not directly criticize MapReduce during his PASS keynote speech, according to blog reports.

Read more about Databases in Computerworld's Databases Topic Center.



What is Tech Briefcase?
TechBriefcase is a new, free service where IT Professionals can Search, Store and Share IT white papers and content like this. Learn more
Bookmark content
Speed up your research efforts with content across the web.
Search and Store
Find the white papers you need. Create folders for any topic.
View Anywhere
Open your briefcase on your iPhone, tablet or desktop. Share with colleagues.
Don't have an account yet?
Additional Resources
Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

Databases White Papers
Measuring the Business Value of CI in the Data Center
One of the key strategies that IT teams are pursuing to reduce capital costs while boosting asset utilization and employee productivity is the...
The Different Types of UPS Systems
There is much confusion in the marketplace about the different types of UPS systems and their characteristics. Each of these UPS types is...
SAS High Performance Analytics
This paper explains how you can shrink decision times from days to seconds to quickly respond to changing business conditions.
Drive Your Business with Predictive Analytics
Predictive analytics has the power to significantly improve the bottom line. From better targeting and risk assessment to streamlining operations and optimizing business...
The Analytical SMB: More Data, More Users, Less Time
This Aberdeen Research Brief examines the key trends in business analytics and the tangible business impact effective analytics can have for SMBs.
All Databases White Papers
Databases Webcasts
Oracle Database Appliance Best Practices
Business users increasingly demand 24x7 availability of their data while IT departments face the challenge of ensuring maximum availability while operating with limited...
Accelerate Document Processing and Wow Your Customers
Learn how intelligent imaging and BPM solutions, coupled with pragmatic best practices and methodology, can improve productivity, lower cost, increase accuracy, reduce cycle...
Distributed Database Security with Real-time Monitoring
View this demo and learn how IBM InfoSphere Guardium database activity monitoring can help protect your sensitive data in distributed DBMS environments with...
InfoSphere Warehouse Packs Demo
These flash modules make warehousing more tangible and relevant to business users through detailed explanations of the InfoSphere Warehouse Packs.
Delivery Management -- Extending Lifecycle Management
Date: Wednesday, June 20, 2012, 1:00 PM EDT

Siloed organizations continue doing the wrong things and doing things wrong, leading to increased costs,...
All Databases Webcasts
Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs