Data exposure: Using software to redact personal data from public documents

Algorithms, manual intervention are among the options used to clean up online documents

The personal data of millions of U.S. residents may have been exposed by the public posting of official documents, and local governments are increasingly looking for ways to automate the process of cleaning up data being put online.

Among the solutions available is redaction software that allows government agencies to remove sensitive personal data from the online images of public records. The software, which is being used in at least two Florida counties now, works in much the same way antispam software does -- by using algorithms to analyze images for specific phrases or words.

Some vendors use multiple levels of automatic analysis, while others narrow down the number of documents likely to need redaction, then use human intervention to winnow the desired data and train the applications for improved automatic redaction.

“It’s a new technology, but a proven technology,” said Paul Miller, president of Aptitude Solutions Inc. in Casselberry, Fla. Aptitude Solutions provides its aiRedact software to Broward and Hillsborough counties in Florida, as well as to counties in other states.

The issue of removing sensitive information -- including Social Security numbers, bank account information, driver’s license data and personally identifying details -- from public documents is gaining attention in light of concerns from privacy advocates. They have argued that the number of public documents being posted online with sensitive data included could open the door for a wave of identity theft and fraud (see ” Data exposure: Counties across the U.S. posting sensitive info online”). To meet that concern, county officials across the nation are turning increasingly to software to remove that data.

Since finding information in scanned images is more complex than simply locating instances of unique words in a text file, the redaction of information can’t be done using traditional methods such as word-pattern analysis, according to Aptitude Solutions.

AiRedact automatically indexes and redacts images using algorithms that look for targeted numbers or words or seeking out related words in context -- adjacent words like “account number” or “Social Security number.” Once keywords are found, the software automatically redacts the information, Miller said. The software can also remove personal information by indicating a certain area on a scanned form for automatic redaction -- as long as the forms have a standard layout with information in fixed locations.

As the application looks for candidates for redaction from among millions of document images, several thousand pages are culled and analyzed individually by a person who can verify that the information should be redacted. As the pool of documents is reviewed, the software automatically adjusts to redact the remaining records based on the choices made manually, Miller said.

The amount of time needed to redact the records depends on the hardware used and the number of records that must be checked, he said. But a typical review process can take two to three months. The software costs typically range from $200,000 to $300,000, depending on the size of the county, he said.

Although the software won’t pick up 100% of the data that needs to be redacted, it does do the vast majority of the work, according to Miller. “It certainly is a challenge,” he said.

In Florida, counties are required to have all online public records redacted for sensitive personal information by Jan. 1, 2007, under a recently enacted state law. All newly posted public records in the state will have to be redacted automatically after that date. The deadline is being challenged by some county clerks in the state, but remains intact.

Other states are looking at similar record-keeping issues, Miller said. “There’s a move afoot in Colorado a little bit and in Pennsylvania a little bit, but no one has this hard-stop deadline like in Florida as far as I’m aware of,” he said.

Another vendor, ImageTech Systems Inc. in Camp Hill, Pa., has built a plug-in redaction module for a widely-used Kofax Ascent Capture application from Kofax Image Products Inc. of Irvine, Calif. R.J. Oommen, principal of ImageTech, said his company so far has no customers in Florida but is beginning to target that state’s local governments to offer the redaction module.

The module uses several methods to analyze online scanned images, including user input on the fly, automatic processing of data in standard forms and an intelligence algorithm that uses “confidence thresholds” and verification modules to automate the process with very little human interaction, Oommen said. The module starts at $5,000, but that price can exceed $100,000 depending on the project, he said.

Other redaction vendors include SRS Technologies, which offers its Document Detective software; Appligent Inc., which offers a Redax redaction module for use with PDF files; and Image Architects Inc., which offers a redaction template creation plug-in for Kofax Ascent Capture and IBM Content Manager applications.

Barbara Petersen, president of the Tallahassee, Fla.-based First Amendment Foundation, a nonprofit media group that supports open government, said the redaction of private information from public records is a right of the state’s residents. Although the Florida Supreme Court is now reviewing how private information should be disclosed in court records, Petersen said, redcating data from online documents shouldn’t be an issue among county clerks -- because doing so doesn’t modify the original document, which is still on file unaltered.

“They claim they’re all about privacy, but they’re not,” she said. “We’re not asking them to modify [documents]. We’re asking them to redact information that is protected from public disclosure.”

Copyright © 2006 IDG Communications, Inc.

Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon