Ads by TechWords

See your link here
Receive the latest technology news and information.
Storage
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
Cloud Computing
View all newsletters




Privacy Policy
 

The Internet Archive's Wayback Machine gets a new data center

It's housed in a 20-foot-long metal shipping container

March 25, 2009 12:00 PM ET

Computerworld - The Internet Archive today announced that it has a new computer behind it its library of 151 billion archived Web pages. The machine fits in a 20-foot-long outdoor metal cargo container filled with 63 server clusters that offer 4.5 million gigabytes of data storage capacity and 1TB of memory.

The Internet Archive has been taking a snapshot of the World Wide Web every two months since 1997, and the images are made available through the Wayback Machine, a Web site that gets about 200,000 visitors a day or about 500 hits per second on the 4.5 petabyte database.

"It may be the single largest database in the world, and it's all in a shipping container. I think of the shipping container as a single machine or expression made up of many smaller machines," said Brewster Kahle, digital librarian and co-founder of the Internet Archive, the nonprofit organization that runs the Wayback Machine site.

For the past 13 years, the Internet Archive has been growing rapidly, most recently by about 100TB of data per month. Until last year, the site had been using a more traditional data center filled with 800 standard Linux servers, each with four hard drives. The new Sun Modular Datacenter that powers it now is on Sun's campus in Santa Clara, Calif., and houses eight racks filled with 63 Sun Fire x4500 servers with dual- or quad-core x86 processors running Solaris 10 with ZFS. Each Sun server is combined with an array of 48 1TB hard drives. The server unit is referred to as a "Thumper."

The Sun Modular Datacenter houses 63 Sun Fire x4500 servers running Solaris 10 with ZFS. Each server has 48TB of capacity.
The Sun Modular Datacenter

"The only thing needed besides [the shipping container] are the network connections, a chilled water supply and electricity," said Dave Douglas, Sun's chief sustainability officer. "Customers using this tend to be people running out of data center space and need something quickly or need a data center in remote area where mobility is key."

The nonprofit Internet Archive, which is based in the Presidio in San Francisco, uses an algorithm that repeats a Web crawl every two months in order add new Web page images its database. The algorithm first performs a broad crawl that starts with a few "seed sites," such as Yahoo's directory. After snapping a shot of the home page, it then moves to any referable pages within the site until there are no more pages to capture. If there are any links on those pages, the algorithm automatically opens them and archives that content as well.

Previously, a typical Web crawl was supported by 10 or 20 clustered Linux servers, Kahle said. The new crawls are supported by the entire data center, as all 63 Sun Fire servers act as a single machine.

Each server has 48 1TB hard drives.
The Thumper

In addition to Web pages, the Archive also keeps software, books and a moving image collection that has 150,000 items in 100 different subcollections, as well as audio clips -- to the tune of 200,000 items in over 100 collections.

"We see this scale of machine, and the idea of putting machines outdoors is a potential long-term trend for organizations like us," Kahle said.

The Internet Archive also works with about 100 physical libraries around the world whose curators help guide deep Internet crawls. The Internet Archive's massive database is mirrored to the Bibliotheca Alexandrina, the new Library of Alexandria in Egypt, for disaster recovery purposes.

In 2008, the Internet Archive used 800 servers, each with four hard drives.
The Internet Archive's previous data center with 800 servers, each with four hard drives.

Read more about storage in Computerworld's Storage Knowledge Center.



Jump to comments

Internet Archive

Additional Resources

Microsoft
Here are some of the key reasons why you would want to run Unified Access Gateway with DirectAccess.
Microsoft
Review how one energy firm tightened protection and simplified IT work using business-ready security solutions.
Sybase
In this white paper, IDC analyzes the role of next-generation mobile enterprise platforms as organizations seek a more strategic deployment of mobile solutions.

Learn the important issues you must consider before starting your next mobility initiative. Get your mobility white paper from IDC now, compliments of Sybase.

What People Are Saying

White Papers & Webcasts

Cache Tier Memory Efficiency with Gear6 Web Cache
Download this valuable white paper!  

Connecting to the Cloud with F5 and VMware VMotion
F5 and VMware partner to enable live application and storage migrations between datacenters and clouds, over short or long distances.  

Virtualize Microsoft Applications on VMware
Register for this live webcast now!

F5 Virtualization Guide: Seven Key Challenges You Can't Ignore
Seven Key Challenges You Can't Ignore  

Strategic ECM Webinar
Learn what new strategic business benefits can be realized through ECM!


IT Jobs

 

Partnered Content
Hitachi - Inspire the Next
Storage Economics: Understanding Tiered Storage Solutions
Storage Economics is a suite of methodologies, tools, and services that help customers identify the total cost of storage ownership and provide a tiered storage solution to reduce ongoing costs. Understand the benefits of implementing a tiered storage architecture which include improving storage capacities and easing the access demands to any single storage tier. Learn more.
Download this white paper 
Strategies for an Increasingly Cost-Conscious Data Storage World
Whatever word you use, we can all agree that the global economy continues to face challenging times. Yet, the essential challenge remains the same: IT demands continue to increase but the resources to address such challenges are being flattened or cut. However, we truly have an opportunity here to do more with less and focus on efficiency. Hitachi can help. Learn more.
Download this white paper 
Four Principles to Reduce TCO
Yes, good news! The good news is that there are proven strategic investments available today for storage infrastructure cost reduction. Smart organizations will follow the principles of Storage Economics to evaluate them not just for their technical prowess but also for how well they can support business performance and particularly efforts to economize. Learn more.
Download this white paper