Skip the navigation

Mining the Deep Web: Search strategies that work

How to become an enlightened searcher

By Lee Ratzan
December 11, 2006 12:00 PM ET

Computerworld - Just because a Web search engine can't find something doesn't mean it isn't there. You may be looking for info in all the wrong places.

The Deep Web is a vast information repository not always indexed by automated search engines but readily accessible to enlightened individuals.

The Shallow Web, also known as the Surface Web or Static Web, is a collection of Web sites indexed by automated search engines. A search engine bot or Web crawler follows URL links, indexes the content and then relays the results back to search engine central for consolidation and user query. Ideally, the process eventually scours the entire Web, subject to vendor time and storage constraints.

The crux of the process lies in the indexing. A bot does not report what it can't index. This was a minor issue when the early Web consisted primarily of static generic HTML code, but contemporary Web sites now contain multimedia, scripts and other forms of dynamic content.

The Deep Web consists of Web pages that search engines cannot or will not index. The popular term "Invisible Web" is actually a misnomer, because the information is not invisible, it's just not bot indexed. Depending on whom you ask, the Deep Web is five to 500 times as vast as the Shallow Web, thus making it an immense and extraordinary online resource. Do the math: If major search engines together index only 20% of the Web, then they miss 80% of the content.

What makes it deep?

Search engines typically do not index the following types of Web sites:

  • Proprietary sites
  • Sites requiring a registration
  • Sites with scripts
  • Dynamic sites
  • Ephemeral sites
  • Sites blocked by local webmasters
  • Sites blocked by search engine policy
  • Sites with special formats
  • Searchable databases

Proprietary sites require a fee. Registration sites require a login or password. A bot can index script code (e.g., Flash, JavaScript), but it can't always ascertain what the script actually does. Some nasty script junkies have been known to trap bots within infinite loops.

Dynamic Web sites are created on demand and have no existence prior to the query and limited existence afterward (e.g., airline schedules).

If you ever noticed an interesting link on a news site, but were unable to find it later in the day, then you have encountered an ephemeral Web site.

Webmasters can request that their sites not be indexed (Robot Exclusion Protocol), and some search engines skip sites based on their own inscrutable corporate policies. Not long ago, search engines could not index files in PDF, thus missing an enormous quantity of vendor white papers and technical reports, not to mention government documents. Special formats become less of an issue as index engines become smarter.



Additional Resources
Forrester Consulting - Optimizing Users and Applications in a Mobile World
WHITE PAPER
Solving application issues over the WAN requires careful consideration. Based on their independent research, Forrester Consulting offers recommendations on how to tackle application performance issues, insufficient bandwidth and the inability to quickly restore users in a disaster.

Read now.

Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

Networking White Papers
Digital Transformation: Creating New Business Models Where Digital Meets Physical
Individuals and businesses alike are embracing the digital revolution. Social networks and digital devices are being used to engage government, businesses and civil...
Make the Connection: Better Network Connectivity Drives Transformation
Network connectivity is more than just plumbing. Leading organizations today see high-performance network connectivity as a critical enabler of competitive advantage, and not...
Virtualizing Government Infrastructure
All server virtualization solutions are not created equal. The more-with-less agenda for government agencies is tailor-made for server virtualization, which is evolving into...
Moving Service Management to SaaS
Today, organizations can enjoy similarly substantial benefi ts by migrating their IT service management functions to a software-as-a-service model. This paper shows how...
Achieving 360 Degree Network Visibility with Nimsoft
360° network visibility is critical for ensuring continuous availability of networks, servers, and applications-anything less could
have costly bottom-line implications.
All Networking White Papers
Networking Webcasts
Optimizing Networks for the Cloud
Join guest speaker, Rohit Mehra, IDC Director of Enterprise Communications Infrastructure, to explore current trends, discuss best practices for optimizing Data Center and...
Unified Communications 101
What's the best way to implement a unified communications solution for your organization?
Try the OptiView® XG on your network - FREE
The OptiView® XG is the first dedicated tablet with automated network and application analysis -- fastest way to root cause. XG raises the...
Apps QuickStart Series Part 2: Designing and Deploying SQL Server on VMware vSphere
Download this webcast to learn about the design considerations for virtualizing SQL workloads, performance and scalability information and high-availability options, as well as...
Apps QuickStart Series Part 1: Designing and Deploying Exchange 2010 on VMware vSphere
Download this webcast to learn the virtual hardware design considerations for Exchange 2010, deployment using the building block approach, options for high-availability and...
All Networking Webcasts
Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs