The Internet Archive's Wayback Machine gets a new data center
It's housed in a 20-foot-long metal shipping container
Computerworld - The Internet Archive today announced that it has a new computer behind it its library of 151 billion archived Web pages. The machine fits in a 20-foot-long outdoor metal cargo container filled with 63 server clusters that offer 4.5 million gigabytes of data storage capacity and 1TB of memory.
The Internet Archive has been taking a snapshot of the World Wide Web every two months since 1997, and the images are made available through the Wayback Machine, a Web site that gets about 200,000 visitors a day or about 500 hits per second on the 4.5 petabyte database.
"It may be the single largest database in the world, and it's all in a shipping container. I think of the shipping container as a single machine or expression made up of many smaller machines," said Brewster Kahle, digital librarian and co-founder of the Internet Archive, the nonprofit organization that runs the Wayback Machine site.
For the past 13 years, the Internet Archive has been growing rapidly, most recently by about 100TB of data per month. Until last year, the site had been using a more traditional data center filled with 800 standard Linux servers, each with four hard drives. The new Sun Modular Datacenter that powers it now is on Sun's campus in Santa Clara, Calif., and houses eight racks filled with 63 Sun Fire x4500 servers with dual- or quad-core x86 processors running Solaris 10 with ZFS. Each Sun server is combined with an array of 48 1TB hard drives. The server unit is referred to as a "Thumper."
"The only thing needed besides [the shipping container] are the network connections, a chilled water supply and electricity," said Dave Douglas, Sun's chief sustainability officer. "Customers using this tend to be people running out of data center space and need something quickly or need a data center in remote area where mobility is key."
The nonprofit Internet Archive, which is based in the Presidio in San Francisco, uses an algorithm that repeats a Web crawl every two months in order add new Web page images its database. The algorithm first performs a broad crawl that starts with a few "seed sites," such as Yahoo's directory. After snapping a shot of the home page, it then moves to any referable pages within the site until there are no more pages to capture. If there are any links on those pages, the algorithm automatically opens them and archives that content as well.
Previously, a typical Web crawl was supported by 10 or 20 clustered Linux servers, Kahle said. The new crawls are supported by the entire data center, as all 63 Sun Fire servers act as a single machine.
In addition to Web pages, the Archive also keeps software, books and a moving image collection that has 150,000 items in 100 different subcollections, as well as audio clips -- to the tune of 200,000 items in over 100 collections.
"We see this scale of machine, and the idea of putting machines outdoors is a potential long-term trend for organizations like us," Kahle said.
The Internet Archive also works with about 100 physical libraries around the world whose curators help guide deep Internet crawls. The Internet Archive's massive database is mirrored to the Bibliotheca Alexandrina, the new Library of Alexandria in Egypt, for disaster recovery purposes.
Read more about Data Storage in Computerworld's Data Storage Topic Center.
- Google I/O 2013's Coolest Products and Services
- 10 Star Trek Technologies That are Almost Here
- 19 Generations of Computer Programmers
- 25 Must-Have Technologies for SMBs
- A walking tour: 33 questions to ask about your company's security
- 15 social media scams
- The 7 elements of a successful security awareness program
- IT Certification Study Tips
- Register for this Computerworld Insider Study Tip guide and gain access to hundreds of premium content articles, cheat sheets, product reviews and more.
- The Total Cost of Email In this white paper, we'll explore the true costs of fragmented email management and uncover how to reduce those costs with a cloud-based...
- The Shape of Email The shape of email is a starting point in helping us understand the qualify of the information residing in the inboxes of organizations...
- SaaS with a Face: User Satisfaction in Cloud-Based E-mail Management with Mimecast Learn how a carefully targeted SaaS approach can add value to your email environment and potentially result in better services within a much...
-
Your Data under Siege: Protection in the Age of BYODs
Download Kaspersky Lab's new whitepaper, Your Data under Siege: Protection in the Age of BYODs, to learn about:
- How a mobile workforce stretches...
- Live Webcast
Get an Integrated Approach to Data Management - This KnowledgeVault Exchange is your one-stop resource center for designing a winning data management strategy with quantifiable top-line gains and bottom-line savings.
- Live Webcast
MFT and FileXpress - An Overview - Business users and applications exchange files on a regular basis. File transfer is a core part of the flow of business activity.
- Live Webcast
Bridging HTTP and FTP with FileXpress Internet Server - What if you could take an FTP server on your internal network, and allow external users (partners or customers) to securely access it...
- 3 Reasons Why Sepaton is the World's Fastest Backup Solution Leading analyst, Storage Switzerland learns how Sepaton backs up and deduplicates massive data volumes while maintaining the industry's fastest performance - all in...
- Bridging HTTP and FTP with FileXpress Internet Server What if you could take an FTP server on your internal network, and allow external users (partners or customers) to securely access it... All Data Storage White Papers | Webcasts