Ads by TechWords

See your link here
Receive the latest technology news and information.
Networking
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
Cloud Computing
View all newsletters




Privacy Policy
 

Google aims to penetrate the Deep Web with HTML forms crawling

Wants search engine to measure sites whose HTML forms blocked them from its crawler

April 11, 2008 12:00 PM ET

Active Comments
Anonymous says: While Google may not include these hidden sites in Search Results, I wonder what they will do with them? Of...
Chuck IT says: I wonder whether this will cause some sites some trouble, worst-case such as triggering purchases. The way I'm reading this...


Computerworld - In a move aimed at taking the search-engine giant closer to what's commonly called the "Deep Web," Google Inc. today said that it has started experimenting to find ways for its search engine to index HTML forms such as drop-down boxes and select menus.

Over the past few months, Google has been trying out some HTML forms to see if they could discover Web pages that couldn't otherwise be found or indexed for users, noted Jayant Madhavan and Alon Halevy, members of Google's crawling and indexing team.

"For text boxes, our computers automatically choose words from the site that has the form; for select menus, check boxes and radio buttons on the form, we choose from among the values of the HTML," they noted in a blog post. "Having chosen the values for each input, we generate and then try to crawl URLs that correspond to a possible query a user may have made. If we ascertain that the Web page resulting from our query is valid, interesting and includes content not in our index, we may include it in our index much as we would include any other Web page."

If a site includes tools for preventing being crawled by a search engine, Google will adhere to those instructions, it said. In addition, the company will omit any forms that require password input or that use terms commonly associated with personal information such as log-ins or user IDs.

The Web pages discovered using the enhanced crawling method will not come at the expense of the regular Web pages that are already part of the crawl, so this methodology won't affect page ranking, Google noted.

"This experiment is part of Google's broader effort to increase its coverage of the Web," Google noted. "In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms … we are able to lead search-engine users to documents that would otherwise not be easily found in search engines, and provide Web masters and users alike with a better and more comprehensive search experience."



Jump to comments

google

Additional Resources

Xerox
By using solid ink technology only from Xerox, you could save up to 65% by printing color for the cost of black and white. Enter for a chance to WIN a PhaserTM 8860 network color printer!
Microsoft
Save time and mitigate security risk. Deploy it now.
Sybase
In this white paper, IDC analyzes the role of next-generation mobile enterprise platforms as organizations seek a more strategic deployment of mobile solutions.

Learn the important issues you must consider before starting your next mobility initiative. Get your mobility white paper from IDC now, compliments of Sybase.

What People Are Saying

White Papers & Webcasts

The 2009 Handbook of Application Delivery
Learn how to become better with application delivery.  

Aligning IT to Business: The Rising Importance of Application Delivery Networks
Application Delivery Networking (ADN) will play a vital role in helping enterprises incorporate strategic technologies to achieve business initiatives.

Unified Application Delivery
By providing a unified Application Delivery Networking platform, F5 BIG-IP offers the ability for organizations to adopt a single platform for all its...  

Preparing Your Business Services for the Future
Would you trust your network monitoring tools enough to know when something is truly halting a business service?

ROI of Application Delivery Controllers
How modern offload technologies in Application Delivery Controllers can drastically reduce expenses in traditional and virtualized architectures, with a fast ROI.  

BMC Application Performance and Analytics: Predictive Intelligence in Action
See the highlights of BMC's Application Performance and Analytics today!

Gartner: Magic Quadrant for Application Delivery Controllers, 2009
The market for products to improve the delivery of application software over networks remains dynamic and innovative. Vendors focused on solving enterprises' most-pressing...  

IPAM: Slashing Network Costs
Slashing Network Costs by Consolidating and Automating Core Network Services

Gartner: Load Balancers are Dead
This research shifts the attention from basic load-balancing features to application delivery features to aid in the deployment and delivery of applications.  

Disaster Recovery & Cost Savings Zone
Thousands of customers world-wide have turned to virtualization solutions from Riverbed as a way to reduce costs.