Skip the navigation
News

The Internet Archive's Wayback Machine gets a new data center

It's housed in a 20-foot-long metal shipping container

By Lucas Mearian
March 25, 2009 12:00 PM ET

Computerworld - The Internet Archive today announced that it has a new computer behind it its library of 151 billion archived Web pages. The machine fits in a 20-foot-long outdoor metal cargo container filled with 63 server clusters that offer 4.5 million gigabytes of data storage capacity and 1TB of memory.

The Internet Archive has been taking a snapshot of the World Wide Web every two months since 1997, and the images are made available through the Wayback Machine, a Web site that gets about 200,000 visitors a day or about 500 hits per second on the 4.5 petabyte database.

"It may be the single largest database in the world, and it's all in a shipping container. I think of the shipping container as a single machine or expression made up of many smaller machines," said Brewster Kahle, digital librarian and co-founder of the Internet Archive, the nonprofit organization that runs the Wayback Machine site.

For the past 13 years, the Internet Archive has been growing rapidly, most recently by about 100TB of data per month. Until last year, the site had been using a more traditional data center filled with 800 standard Linux servers, each with four hard drives. The new Sun Modular Datacenter that powers it now is on Sun's campus in Santa Clara, Calif., and houses eight racks filled with 63 Sun Fire x4500 servers with dual- or quad-core x86 processors running Solaris 10 with ZFS. Each Sun server is combined with an array of 48 1TB hard drives. The server unit is referred to as a "Thumper."

The Sun Modular Datacenter houses 63 Sun Fire x4500 servers running Solaris 10 with ZFS. Each server has 48TB of capacity.
The Sun Modular Datacenter

"The only thing needed besides [the shipping container] are the network connections, a chilled water supply and electricity," said Dave Douglas, Sun's chief sustainability officer. "Customers using this tend to be people running out of data center space and need something quickly or need a data center in remote area where mobility is key."

The nonprofit Internet Archive, which is based in the Presidio in San Francisco, uses an algorithm that repeats a Web crawl every two months in order add new Web page images its database. The algorithm first performs a broad crawl that starts with a few "seed sites," such as Yahoo's directory. After snapping a shot of the home page, it then moves to any referable pages within the site until there are no more pages to capture. If there are any links on those pages, the algorithm automatically opens them and archives that content as well.

Previously, a typical Web crawl was supported by 10 or 20 clustered Linux servers, Kahle said. The new crawls are supported by the entire data center, as all 63 Sun Fire servers act as a single machine.

Each server has 48 1TB hard drives.
The Thumper

In addition to Web pages, the Archive also keeps software, books and a moving image collection that has 150,000 items in 100 different subcollections, as well as audio clips -- to the tune of 200,000 items in over 100 collections.

"We see this scale of machine, and the idea of putting machines outdoors is a potential long-term trend for organizations like us," Kahle said.

The Internet Archive also works with about 100 physical libraries around the world whose curators help guide deep Internet crawls. The Internet Archive's massive database is mirrored to the Bibliotheca Alexandrina, the new Library of Alexandria in Egypt, for disaster recovery purposes.

In 2008, the Internet Archive used 800 servers, each with four hard drives.
The Internet Archive's previous data center with 800 servers, each with four hard drives.

Read more about Storage in Computerworld's Storage Topic Center.



Additional Resources
Forrester Consulting - Optimizing Users and Applications in a Mobile World
WHITE PAPER
Solving application issues over the WAN requires careful consideration. Based on their independent research, Forrester Consulting offers recommendations on how to tackle application performance issues, insufficient bandwidth and the inability to quickly restore users in a disaster.

Read now.

Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

Storage White Papers
Datacenter Consolidation Best Practices Whitepaper
The benefits of storage consolidation are being realized by companies and seen as a way to streamline many storage-driven applications. Learn why the...
Eliminating VMware / Storage Related Performance Challenges
How to proactively monitor the performance in a Fibre Channel SAN / vSphere environment is always a concern. Understand the importance of a...
Cloud Environments Have Familiar Storage Challenges
Cloud environments have many storage challenges that are familiar to data center managers, but due to their density and abstraction, the issues become...
Eight Considerations for Evaluating Disk-Based Backup Solutions
In the past, the movement from tape- to disk-based backup has been less compelling due to the expense of storing backup data on...
ExaGrid Helps U.S. Federal Government Agencies Reduce Backup Windows and Improve Data Protection
The U.S. Government has been the largest user of tape-based backup systems since the 1970s. Most agencies have begun to deploy disk storage...
All Storage White Papers
Storage Webcasts
Understand Your Data: The Future of Backup and Archiving
Archiving and Backup are the foundation of the next generation of information governance. However, commodity data protection tools and basic archives are only...
Optimizing Networks for the Cloud
Join guest speaker, Rohit Mehra, IDC Director of Enterprise Communications Infrastructure, to explore current trends, discuss best practices for optimizing Data Center and...
Apps QuickStart Series Part 2: Designing and Deploying SQL Server on VMware vSphere
Download this webcast to learn about the design considerations for virtualizing SQL workloads, performance and scalability information and high-availability options, as well as...
Apps QuickStart Series Part 1: Designing and Deploying Exchange 2010 on VMware vSphere
Download this webcast to learn the virtual hardware design considerations for Exchange 2010, deployment using the building block approach, options for high-availability and...
Customer Spotlight: How IPC The Hospitalist Company Implemented Oracle on VMware
Have you been looking to hear about customer's experiences with the new VMware vCenter Site Recovery Manager product? View this webcast to learn...
All Storage Webcasts
Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs