Sidebar: Web harvesting and libraries

Along with the growth in the amount of information available, the Web has undergone a radical shift from being a platform for distributing information among IT workers to a general tool for exchanging communications and data throughout society. This, along with the fact that material on the Web changes daily and may quickly disappear, has triggered an awareness that much information on the Web deserves to be archived for future use, and numerous libraries have undertaken Web harvesting and archiving projects.

The biggest of all such projects is called the Wayback Machine, located at It contains over 30 billion Web pages archived from 1996 onward. It's a terrific tool for looking into companies and organizations that may no longer exist, for seeing snapshots of Web pages long gone, a view of cyberspace in times past. The Wayback Machine doesn't have everything, but its scope is remarkable.

The U.S. Library of Congress has a program called Minerva (Mapping the Internet Electronic Resources Virtual Archive) aimed at collecting and preserving primary source materials created in digital formats (a.k.a. "born digital") that don't exist in any physical form. In the pilot program's first two years, the library has sponsored five event-based harvests of Web sites: Election 2000, Election 2002, Sept. 11, Sept. 11 Remembrance and Olympics 2002. The Minerva collection ( currently includes more than 35,000 Web sites consisting of more than 500 million Web pages.

Special Report

The Future of BI

Stories in this report:

Copyright © 2004 IDG Communications, Inc.

It’s time to break the ChatGPT habit
Shop Tech Products at Amazon