Google aims to penetrate the Deep Web with HTML forms crawling
Wants search engine to measure sites whose HTML forms blocked them from its crawler
Computerworld - In a move aimed at taking the search-engine giant closer to what's commonly called the "Deep Web," Google Inc. today said that it has started experimenting to find ways for its search engine to index HTML forms such as drop-down boxes and select menus.
Over the past few months, Google has been trying out some HTML forms to see if they could discover Web pages that couldn't otherwise be found or indexed for users, noted Jayant Madhavan and Alon Halevy, members of Google's crawling and indexing team.
"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes and radio buttons on the form, we choose from among the values of the HTML," they noted in a blog post. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may include it in our index much as we would include any other Web page."
If a site includes tools for preventing being crawled by a search engine, Google will adhere to those instructions, it said. In addition, the company will omit any forms that require password input or that use terms commonly associated with personal information such as log-ins or user IDs.
The Web pages discovered using the enhanced crawling method will not come at the expense of the regular Web pages that are already part of the crawl, so this methodology won't affect page ranking, Google noted.
"This experiment is part of Google's broader effort to increase its coverage of the Web," Google noted. "In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms … we are able to lead search-engine users to documents that would otherwise not be easily found in search engines, and provide Web masters and users alike with a better and more comprehensive search experience."
Read more about Web 2.0 and Web Apps in Computerworld's Web 2.0 and Web Apps Topic Center.



- Excel 2010 Cheat Sheet
- Register for this Computerworld Insider Cheat Sheet and gain access to hundreds of premium content articles, guides, product reviews and more.
- Why Business Ethernet Services?
- Everybody's heard the cliché, "the network is your business." But that's not going to help you choose the best wide area networking service...
- Overcome Top 7 Admin Challenges of Active Directory
- As Active Directory's role in the enterprise has drastically increased, so has the need to secure the data. Gain insight on creating repeatable,...
- Insiders Can Ruin Your Company. Take Action.
- Did you know that 80 percent of threats to an organization come from the inside? The threat from insiders is often overlooked in...
- Top Solutions and Tools to Prevent Devastating Malware
- Custom malware frequently goes undetected. According to Forrester Research, the best way to reduce risk of breach is to deploy file integrity monitoring...
- Streamline Compliance and Increase ROI
- Streamline, simplify, and automate compliance related activities; especially those that impact multiple business units. This white paper from NetIQ, outlines solutions that will... All Web 2.0 and Web Apps White Papers
- Optimizing Networks for the Cloud
- Join guest speaker, Rohit Mehra, IDC Director of Enterprise Communications Infrastructure, to explore current trends, discuss best practices for optimizing Data Center and...
- Apps QuickStart Series Part 2: Designing and Deploying SQL Server on VMware vSphere
- Download this webcast to learn about the design considerations for virtualizing SQL workloads, performance and scalability information and high-availability options, as well as...
- Apps QuickStart Series Part 1: Designing and Deploying Exchange 2010 on VMware vSphere
- Download this webcast to learn the virtual hardware design considerations for Exchange 2010, deployment using the building block approach, options for high-availability and...
- Customer Spotlight: How IPC The Hospitalist Company Implemented Oracle on VMware
- Have you been looking to hear about customer's experiences with the new VMware vCenter Site Recovery Manager product? View this webcast to learn...
- Virtualize Business-Critical Applications with Confidence
- Virtualizing business-critical applications has become a key focus for organizations as they move along their virtualization journey. With the launch of VMware vSphere®... All Web 2.0 and Web Apps Webcasts