Definition: The deep Web, also called the invisible Web, refers to the mass of information that can be accessed via the World Wide Web but can't be indexed by traditional search engines -- often because it's locked up in databases and served up as dynamic pages in response to specific queries or searches.
Most writers these days do a significant part of their research using the World Wide Web, with the help of powerful search engines such as Google and Yahoo. There is so much information available that one could be forgiven for thinking that "everything" is accessible this way, but nothing could ber further from the truth. For example, as of August 2005, Google claimed to have indexed 8.2 billion Web pages and 2.1 billion images. That sounds impressive, but it's just the tip of the iceberg. Behold the deep Web.
According to Mike Bergman, chief technology officer at BrightPlanet Corp. in Sioux Falls, S.D., more than 500 times as much information as traditional search engines "know about" is available in the deep Web. This massive store of information is locked up inside databases from which Web pages are generated in response to specific queries. Although these dynamic pages have a unique URL address with which they can be retrieved again, they are not persistent or stored as static pages, nor are there links to them from other pages.
The deep Web also includes sites that require registration or otherwise restrict access to their pages, prohibiting search engines from browsing them and creating cached copies.
Let's recap how conventional search engines create their databases. Programs called spiders or Web crawlers start by reading pages from a starting list of Web sites. These spiders first read each page on a site, index all their content and add the words they find to the search engine's growing database. When a spider finds a hyperlink to another page, it adds that new link to the list of pages to be indexed. In time, the program reaches all linked pages, presuming that the search engine doesn't run out of time or storage space. These linked pages, reachable from other Web pages or sites, constitute what most of us use and refer to as the Internet or the Web. In fact, we have only scratched the surface, which is why this realm of information is often called the surface Web.
Why don't our search engines find the deeper information? For starters, let's consider a typical data store that an individual or enterprise has collected, containing books, texts, articles, images, laboratory results and various other kinds of data in diverse formats. Typically we access such databased information by means of a query or search -- we type in the subject or keyword we're looking for, the database retrieves the appropriate content, and we are shown a page of results to our query.
If we can do this easily, why can't a search engine? We assume that the search engine can reach the query input (or search) page, and it will capture the text on that page and in any pages that may have static hyperlinks to it. But unlike the typical human user, the spider can't know what words it should type into the query field. Clearly, it can't type in every word it knows about, and it doesn't know what's relevant to that particular site or database. If there's no easy way to query, the underlying data remains invisible to the search engine. Indeed, any pages that are not eventually connected by links from pages in a spider's initial list will be invisible and thus are not part of the surface Web as that spider defines it.
How Deep? How Big?
According to a 2001 BrightPlanet study, the deep Web is very big indeed: The company found that the 60 largest deep Web sources contained 84 billion pages of content with about 750TB of information. These 60 sources constituted a resource 40 times larger than the surface Web. Today, BrightPlanet reckons the deep Web totals 7500TB, with more than 250,000 sites and 500 billion individual documents. And that's just for Web sites in English or European character sets. (For comparison, remember that Google, the largest crawler-based search engine, now indexes some 8 billion pages.) Bergman's company, a vendor of deep Web harvesting software that works mainly with the intelligence community, accesses sites in over 140 languages, many based on non-Latin characters. BrightPlanet routinely ships its products with links to over 70,000 deep Web sources, all translated into English. Bergman says that his customers are probably accessing two to three times that many sources.
The deep Web is getting deeper and bigger all the time. Two factors seem to account for this. First, newer data sources (especially those not in English) tend to be of the dynamic-query/searchable type, which are generally more useful than static pages. Second, governments at all levels around the world have made commitments to making their official documents and records available on the Web. Bergman says he's aware of at least 10 U.S. states that maintain single-access portals to all state documents and public records.
Interestingly, deep Web sites appear to receive 50% more monthly traffic than surface sites do, and they have more sites linked to them, even though they are not really known to the public. They are typically narrower in scope but likely to have deeper, more detailed content. According to Bergman, only about 5% of the deep Web requires fees or subscriptions.
Kay is a Computerworld contributing writer in Worcester, Mass. You can contact him at firstname.lastname@example.org.
See additional Computerworld QuickStudies