Google aims to penetrate the Deep Web with HTML forms crawling
Wants search engine to measure sites whose HTML forms blocked them from its crawler
April 11, 2008 12:00 PM ETComputerworld - In a move aimed at taking the search-engine giant closer to what's commonly called the "Deep Web," Google Inc. today said that it has started experimenting to find ways for its search engine to index HTML forms such as drop-down boxes and select menus.
Over the past few months, Google has been trying out some HTML forms to see if they could discover Web pages that couldn't otherwise be found or indexed for users, noted Jayant Madhavan and Alon Halevy, members of Google's crawling and indexing team.
"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes and radio buttons on the form, we choose from among the values of the HTML," they noted in a blog post. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may include it in our index much as we would include any other Web page."
If a site includes tools for preventing being crawled by a search engine, Google will adhere to those instructions, it said. In addition, the company will omit any forms that require password input or that use terms commonly associated with personal information such as log-ins or user IDs.
The Web pages discovered using the enhanced crawling method will not come at the expense of the regular Web pages that are already part of the crawl, so this methodology won't affect page ranking, Google noted.
"This experiment is part of Google's broader effort to increase its coverage of the Web," Google noted. "In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms … we are able to lead search-engine users to documents that would otherwise not be easily found in search engines, and provide Web masters and users alike with a better and more comprehensive search experience."
Read more about internet applications in Computerworld's Internet Applications Knowledge Center.
Additional Resources



White Papers & Webcasts
Network Managed Services: A Cost-Effective Approach to Complexity
Outsourcing network management can save time and drive lower total cost of ownership.
Data in Action: Making the Planet Smarter
Register Now
Infrastructure 2.0 - Grainger Reduces Network Expenses While Boosting Availability
Keeping the Network Strategic to the Business
Oracle Accelerate - Not Just Smart but Timely
Download Now!
The Workday User Experience Video
Watch Workday's Creative Director, Scott Lietzke, discuss the business-centered design philosophy at Workday.
Why BI is Ripe - Now! - For Businesses of Any Size
Download Now!
Business Process Framework Demo
Learn about Configurable Business Processes and Calculated Fields. Watch Now!
Manager Experience Demo
Go beyond self-service solutions to perform more effectively. Watch Now.
Computerworld Reports
Disaster Recovery & Cost Savings Zone
Thousands of customers world-wide have turned to virtualization solutions from Riverbed as a way to reduce costs.

