Optical Character Recognition

1 2 Page 2
Page 2 of 2

Advances are being made to recognize characters based on the context of the word in which they appear, as with the Predictive Optical Word Recognition algorithm from Peabody, Mass.-based ScanSoft Inc. The next step for developers is document recognition, in which the software will use knowledge of the parts of speech and grammar to recognize individual characters.

Today, OCR software can recognize a wide variety of fonts, but handwriting and script fonts that mimic handwriting are still problematic.

Developers are taking different approaches to improve script and handwriting recognition. OCR software from ExperVision Inc. in Fremont, Calif., first identifies the font and then runs its character-recognition algorithms.

Advances have made OCR more reliable; expect a minimum of 90% accuracy for average-quality documents. Despite vendor claims of one-button scanning, achieving 99% or greater accuracy takes clean copy and practice setting scanner parameters and requires you to "train" the OCR software with your documents.

The first step toward better recognition begins with the scanner. The quality of its charge-coupled device light arrays will affect OCR results. The more tightly packed these arrays, the finer the image and the more distinct colors the scanner can detect.

Smudges or background color can fool the recognition software. Adjusting the scan's resolution can help refine the image and improve the recognition rate, but there are trade-offs.

For example, in an image scanned at 24-bit color with 1,200 dots per inch (dpi), each of the 1,200 pixels has 24 bits' worth of color information. This scan will take longer than a lower-resolution scan and produce a larger file, but OCR accuracy will likely be high.

A scan at 72 dpi will be faster and produce a smaller file—good for posting an image of the text to the Web—but the lower resolution will likely degrade OCR accuracy.

Most scanners are optimized for 300 dpi, but scanning at a higher number of dots per inch will increase accuracy for type under 6 points in size.

Bilevel (black and white only) scans are the rule for text documents. Bilevel scans are faster and produce smaller files, because unlike 24-bit color scans, they require only one bit per pixel. Some scanners can also let you determine how subtle to make the color differentiation.

Which method will be more effective depends on the image being scanned. A bilevel scan of a shopworn page may yield more legible text. But if the image to be scanned has text in a range of colors, as in a brochure, text in lighter colors may drop out.

Lais is a freelance writer in Takoma Park, Md.

1pixclear.gif

The Many Facets of OCR


Determining what text is in an image can be a difficult task. Consider the process below used in a language-independent OCR system described by researchers at BBN Technologies Inc. and GTE Internetworking (now Genuity Inc.). The top half of the diagram shows elements used in setting and training the system and in using scanned data, as well as rules specific to the language and its orthography (the alphabet or other symbols).

The Many Facets of OCR

See additional Computerworld QuickStudies

Copyright © 2002 IDG Communications, Inc.

1 2 Page 2
Page 2 of 2
  
Shop Tech Products at Amazon