Google sheds light on 'Dark Web' by searching scanned documents
Adds OCR technology so Google engine can index and search scanned PDF documents
Computerworld - Google Inc. this week took another step in its effort to shed light on the so-called Dark Web, announcing that its engine can now search scanned documents in a PDF.
Using optical character recognition (OCR) technology, Google's search engine now can convert scanned PDF documents into text that can be searched and indexed, the company said. Thus, government reports, academic papers and other scanned documents can now show up in search results. Search engines generally interpret PDF documents as images of text rather than text.
"While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read," Evin Levey, a Google product manager, wrote in a blog post. "To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. In the past, scanned documents were rarely included in search results as we couldn't be sure of their content. We had occasional clues from references to the document -- so you might get a search result with a title but no snippet highlighting your query."
This is part on an ongoing effort by Google to shed more light on the Deep, or Dark Web, where lies a massive amount of information that can be accessed but not indexed by a search engine because it is behind databases or in a format -- like PDF -- that can't be easily searched. In April, Google announced that it had started experimenting to find ways for its search engine to index HTML forms such as drop-down boxes or select menus that otherwise couldn't be found or indexed.
Jason Kincaid, a blogger at TechCrunch, noted that searching scanned documents equates to a "feat that requires and an immense amount of processing power" and advanced optical imaging technology.
"In the past, Google would attempt to index these image files as well as possible, but could typically search only file titles and nearby metadata -- not the contents of the documents," he added. "From now on, Google searches will include the text within these scanned images in normal search results. Such technology has existed for quite a while, but accuracy has always been an issue -- and the fact that Google is doing it on such a massive scale makes it a very impressive accomplishment. It also opens the doors to much more thorough searching, especially for content that is often found in printed documents (like academic papers)."
Read more about Web Apps in Computerworld's Web Apps Topic Center.
- Google I/O 2013's Coolest Products and Services
- 10 Star Trek Technologies That are Almost Here
- 19 Generations of Computer Programmers
- 25 Must-Have Technologies for SMBs
- A walking tour: 33 questions to ask about your company's security
- 15 social media scams
- The 7 elements of a successful security awareness program
- IT Certification Study Tips
- Register for this Computerworld Insider Study Tip guide and gain access to hundreds of premium content articles, cheat sheets, product reviews and more.
- Anticipate, Engage and Deliver Exceptional Web Experiences IBM Customer Experience Suite and IBM Intranet Experience Suite help organizations delight customers through a consistently exceptional web experience and empower employees with...
- Manage Virtualized and Cloud Environments and the New Software-defined Data Center Analyst report by Enterprise Management Associates on the newly announced EMC Service Assurance Suite, and how well it addresses operational challenges and market...
- How Storage Resource Management Suite Meets Today's Storage Management Challenges This white paper outlines the common use cases Storage Resource Management Suite addresses including comprehensive monitoring, reporting, and analysis for heterogeneous block, file,...
- Sepaton DBeXstream Enhancements Silverton Consulting weighs in on why Sepaton is a compelling response to the data protection challenges inherent in today's large enterprise database environments...
- 3 Reasons Why Sepaton is the World's Fastest Backup Solution Leading analyst, Storage Switzerland learns how Sepaton backs up and deduplicates massive data volumes while maintaining the industry's fastest performance - all in...
- Enterprise File Sharing: All You Need to Know Security. Scalability. Control. These are just some of the many benefits of enterprise cloud file-sharing that you'll discover in this KnowledgeVault, packed with... All Web Apps White Papers | Webcasts
Our weekly newsletter will cover a wide range of topics and trends related to consumerization. Stay up to date with news, reviews and in-depth coverage of BYOD, smartphones, tablets, MDM, cloud, social and how consumerization affects IT. Subscribe now!