Skip the navigation

Digging Into Documents

As the need to exploit unstructured data grows, text mining technology is evolving to meet it, says ClearForest's Ronen Feldman.

By Tommy Peterson
December 22, 2003 12:00 PM ET

Computerworld - Formerly used primarily by the intelligence community and businesses that are strongly dependent upon research, text mining technologies are now beginning to find more general acceptance. The mounds of unstructured data that have been piling up in companies for decades are growing larger as a result of new regulatory requirements that are forcing companies to retain e-mail and other documents - and to be able to find specific information in them. The key to making text mining work for business -- not to mention the intelligence community -- is striking a balance between accuracy and speed, says Ronen Feldman, chief scientist at text mining software company ClearForest Corp. in New York. In a recent interview with Computerworld's Tommy Peterson, Feldman discussed how text mining technologies work and what promise they hold for business.

What is text mining? How do you squeeze information out of unstructured data? Basically, text mining is the same thing as data mining for structured data, but for documents. So the first thing that you have to do is create some structure. In order to create structure, you actually have several possibilities. The easiest way is to work with the bag-of-words model. Basically, each document is just a collection of words. That is purely statistical -- you're doing no semantic analysis. There are still some companies who are doing this. They basically use the simplest possible approach.
The next level is categorization. You basically provide tags for the whole document. The last way to structure the documents, which is the most sophisticated, is to do information extraction. There you don't provide tags for entire documents, but you actually extract entities and relationships from the document. But that means that the processing is much more sophisticated and obviously takes more time.

Do companies have to choose speed vs. sophistication? This is the spectrum -- [bag-of-words] is the easiest and of course the fastest, but it doesn't buy you a lot of mileage, because there is no semantic analysis. With [categorization], you have a little more, but still it's still not a good enough infrastructure, because you won't have enough tags per document -- usually two or three tags per document.
Let's take a document of two pages. If you do information extraction, you can expect 50 to 100 tags, a completely different order of magnitude. Clearly, you get a much better foundation for text mining. Information extraction is the key challenge, and it's what really lies at the heart of our ClearText product.

Ronen Feldman, president and chief scientist at ClearForest Corp.
Ronen Feldman, president and chief scientist at ClearForest Corp.
Tell me more about information extraction. There are two main camps in how to do information extraction. The first camp is the knowledge engineering camp, where structurally derived patterns help you to identify that specific noun phrases should belong to a certain class. The classes would depend on the domain in which that document is living. If we're talking about the intelligence domain, then the classes of entities you'd be interested in would be people, organizations, weapons, things like this. Relationships would be ... family relationships, people who served together in the army, two people who talked on the phone. In order to develop those entities and relationships in the knowledge engineering approach, you have to define patterns for each entity and for each relationship. You do it usually if you have a very good development environment, and [ClearForest] has had such an environment for six years, which we continue to enhance and add more features to all the time.
The second camp is based on machine-learning algorithms. In machine learning, you basically learn by example. There are rules, but the rules are written automatically, so it's mainly statistical. The problem is that you need to provide thousands of examples sometimes, meaning thousands of documents. Thousands of documents can take you several months. We saw in a practical project [that] customers are just not willing to do it and in many cases just killed this approach completely. They prefer to use our approach because then they can rely on generic concepts that we have developed already. It's not as though we start from scratch -- we have already developed most of the domain-specific entities.

Does this end up being knowledge management? Knowledge management is a very broad term; people have used so many different tools to do knowledge management. We are at the infrastructure level, so most of the knowledge management tools should use our software.

Do you worry about legal issues? When corporate e-mails are mined, will employees feel that their privacy is being invaded? We provide the tools; the usage is up to the customer. They need to worry if they are doing something which is illegal. We create generic technology and sell it to customers. They have to live up to traditional promises not to snoop around their employees too much. The only area I can see it used is in compliance, and that should be legal for companies to check that their employees are not doing anything they shouldn't.

Can you give me a good idea of what this technology is going to bring to a specific business? Let's take a pharmaceutical company. The researchers need to read a lot of papers in order to make inferences and get more acquainted with the subject. And usually they spend a lot of time with [the] Medline [Web site] and [scientific] journals like that. And they spend a lot of time just searching. With an application like ours, they can can take entities they are familiar with -- genes, etc. -- and specify the queries in a much more focused way. And that means that they focus a lot more on the real development. The hard labor of searching for the information will be saved, which will shorten the time that they take to find new drugs.

How will this technology change the way companies do research?
I think that most of the hard labor will be saved, and you will be able to focus on thinking and making inferences and conclusions -- things machines are actually not so good at.


Additional Resources
Forrester Consulting - Optimizing Users and Applications in a Mobile World
WHITE PAPER
Solving application issues over the WAN requires careful consideration. Based on their independent research, Forrester Consulting offers recommendations on how to tackle application performance issues, insufficient bandwidth and the inability to quickly restore users in a disaster.

Read now.

Security KnowledgeVault
WHITE PAPER
Security is not an option. This KnowledgeVault Series offers professional advice how to be proactive in the fight against cybercrimes and multi-layered security threats; how to adopt a holistic approach to protecting and managing data; and how to hire a qualified security assessor. Make security your Number 1 priority.

Read now.

Cut Communications Costs Once and for All
WHITE PAPER
New IP-based communications systems are being deployed by small and midsized businesses at a rapid rate. Learn how these organizations are enabling faster responsiveness, creating better customer experiences, speeding office or mobile interactions, and dramatically reducing existing communications costs.

Read now.

White Papers
Forrester Total Economic Impact (TEI) Case Study - Oracle
In this paper, Forrester Consulting examines the total economic impact and potential return on investment (ROI) realized by three Enterprise organizations as they...
The Hidden Truth About Virtualizing Business-Critical Applications
This IDG whitepaper highlights key findings based on the Quickpoll Survey conducted with more than 300 Enterprise and Commercial IT decision makers worldwide...
Top 10 Myths About Virtualizing Business-Critical Applications
Even though virtualization has brought positive change to enterprise IT over the last decade, some skepticism remains about how valuable virtualization can be...
Enterprise Java Applications on VMware: Unix to Linux Migration Guide
This guide focuses on key considerations for IT Architects who are in the process of migrating Java applications from UNIX to Linux as...
Virtualizing Tier 1 Applications: A Critical Step on the Journey Toward the Private Cloud  
This IDC white paper explains how much of the Enterprise IT community is at a crossroads in extending their journey to the private...
Webcasts
Live Webcast
North Pole to South Seas: Overcoming the Pitfalls of remote Performance
In today's always-on world, connectivity is a business requirement. You need the tools that allow you to operate as if you were on...
Live Webcast
Playing Defense: Staying on Top of Your Disaster Recovery Game
When it comes to disaster recovery, rapidly growing data volumes, distributed computing models, and new technologies all combine to present an ever-changing playing...
Live Webcast
Banish Poor Application Performance: Eliminate Business Disruptions, Increase End User Productivity
End User Experience, 30-Min Webinar
Wed. Feb. 22nd ~ 11 AM ET

Are you ready to gain the proactive ability to rapidly respond...
Apps QuickStart Series Part 2: Designing and Deploying SQL Server on VMware vSphere
Download this webcast to learn about the design considerations for virtualizing SQL workloads, performance and scalability information and high-availability options, as well as...
Apps QuickStart Series Part 1: Designing and Deploying Exchange 2010 on VMware vSphere
Download this webcast to learn the virtual hardware design considerations for Exchange 2010, deployment using the building block approach, options for high-availability and...
Virtualize Business-Critical Applications with Confidence
Virtualizing business-critical applications has become a key focus for organizations as they move along their virtualization journey. With the launch of VMware vSphere®...
Discover the Benefits of Virtualization for Federal Applications
Want to say goodbye to missed SLAs? VMware can help you virtualize mission-critical applications such as Oracle, MS Exchange and SharePoint to achieve...
Reduce Application Lifecycle Management Costs with VMware ThinApp
Traditional desktop application deployment and management is a time-consuming and costly endeavor for IT. From development to deployment, including help desk support, the...
Newsletter Sign-Up

Receive the latest news test, reviews and trends on your favorite technology topics

Choose a newsletter
  1. View all newsletters | Privacy Policy
IT Jobs