Data exposure: Using software to redact personal data from public documents
Algorithms, manual intervention are among the options used to clean up online documents
Computerworld - The personal data of millions of U.S. residents may have been exposed by the public posting of official documents, and local governments are increasingly looking for ways to automate the process of cleaning up data being put online.
Among the solutions available is redaction software that allows government agencies to remove sensitive personal data from the online images of public records. The software, which is being used in at least two Florida counties now, works in much the same way antispam software does -- by using algorithms to analyze images for specific phrases or words.
Some vendors use multiple levels of automatic analysis, while others narrow down the number of documents likely to need redaction, then use human intervention to winnow the desired data and train the applications for improved automatic redaction.
Its a new technology, but a proven technology, said Paul Miller, president of Aptitude Solutions Inc. in Casselberry, Fla. Aptitude Solutions provides its aiRedact software to Broward and Hillsborough counties in Florida, as well as to counties in other states.
The issue of removing sensitive information -- including Social Security numbers, bank account information, drivers license data and personally identifying details -- from public documents is gaining attention in light of concerns from privacy advocates. They have argued that the number of public documents being posted online with sensitive data included could open the door for a wave of identity theft and fraud (see Data exposure: Counties across the U.S. posting sensitive info online). To meet that concern, county officials across the nation are turning increasingly to software to remove that data.
Since finding information in scanned images is more complex than simply locating instances of unique words in a text file, the redaction of information cant be done using traditional methods such as word-pattern analysis, according to Aptitude Solutions.
AiRedact automatically indexes and redacts images using algorithms that look for targeted numbers or words or seeking out related words in context -- adjacent words like account number or Social Security number. Once keywords are found, the software automatically redacts the information, Miller said. The software can also remove personal information by indicating a certain area on a scanned form for automatic redaction -- as long as the forms have a standard layout with information in fixed locations.
As the application looks for candidates for redaction from among millions of document images, several thousand pages are culled and analyzed individually by a person who can verify that the information should be redacted. As the pool of documents is reviewed, the software automatically adjusts to redact the remaining records based on the choices made manually, Miller said.


- Excel 2010 Cheat Sheet
- Register for this Computerworld Insider Cheat Sheet and gain access to hundreds of premium content articles, guides, product reviews and more.
- Practice Management: Double Billing Rate and Improve Patient Services
- Would you like to double your billing rate and achieve faster payment for services?
Download this customer success story to see how One Health... - Mission Critical Data Explosion and Customer Case Study
- Would you like to double your tier 1 storage capacity while simultaneously reducing your storage footprint?
Download this customer success story to see how... - Protecting Against Database Attacks and Insider Threats: Top 5 Scenarios
- Read this new eBook to learn the top five scenarios and essential best practices for preventing database attacks and insider threats.
- Database Activity Monitoring Is Evolving
- Read the analyst report and learn how you can leverage the core capabilities of a DAP solution for better database security.
- Establishing a Strategy for Database Security is No Longer Optional
- The options for securing increasingly valuable databases are very broad and deep, and can be confusing. This research provides an overview of three... All Privacy White Papers
- Close a Dangerous Vulnerability: Automated Methods for Managing Admin Rights
- In this exclusive webcast from Viewfinity, you'll hear how to leverage Group Policy Object settings to close this vulnerability by elevating privileges for...
- Data Protection and Disaster Recovery with iSCSI and VMware
- Get this on demand webcast now
- Distributed Database Security with Real-time Monitoring
- View this demo and learn how IBM InfoSphere Guardium database activity monitoring can help protect your sensitive data in distributed DBMS environments with...
- InfoSphere Warehouse Packs Demo
- These flash modules make warehousing more tangible and relevant to business users through detailed explanations of the InfoSphere Warehouse Packs.
- Delivery Management -- Extending Lifecycle Management
- Date: Wednesday, June 20, 2012, 1:00 PM EDT
Siloed organizations continue doing the wrong things and doing things wrong, leading to increased costs,...
All Privacy Webcasts