
![]() |

Subscribe to
Computerworld
or
Other Business Intelligence Stories
April 15, 2002 (Computerworld) --
Researchers are inventing better ways to find and make sense of information. Efforts to improve data mining and searching are being driven by the deluge of information in this increasingly networked world and by companies' need to respond ever faster to changes. And, sadly, the field has gotten a big boost from the terrorist attacks on the U.S. last fall.
Computerworld looked at some of this research and found companies perfecting techniques for machine learning, real-time analysis of data flows, distributed data mining and the discovery of "nonobvious" relationships.
Known Associates
Systems Research & Development (SRD) developed its Non-Obvious Relationship Awareness (NORA) technology to help casinos identify cheaters by correlating information from multiple sources about relationships and earlier transactions.
Las Vegas-based SRD, which received funding from the CIA, is now developing several NORA plug-ins to reach further into the world of criminals and terrorists. Last month, the company unveiled a "degrees of separation" capability that finds deeper connections among people.
"It will tell you that the Drug Enforcement Agency's agent's college roommate's ex-wife's current husband is the drug lord," says Jeff Jonas, chief technology officer at SRD. NORA can bridge up to 30 such links, he says.

![]()
Prabhakar Raghavan, CTO at Verity Inc.
![]()
SRD is also developing the concept of "cascading" NORA data warehouses for really big problems.
For example, Jonas says, each airline might have a copy of NORA processing its passenger data and sending the summarized results to a midtier NORA system at the Federal Aviation Administration. Car rental agencies might send their NORA results to a rental car association. And the U.S. Immigration and Naturalization Service could collect data from ports of entry.
All three midtier NORA systems would then send transactions to the top-tier system at the Office of Homeland Security in Washington. They would communicate with one another in a "zero administration" arrangement in which rules and filters would determine whether a piece of information got passed up or down the chain, Jonas says.
Outbreak Detection
If a bioterrorist attack occurred, it would be critical for health and law enforcement officials to find out quickly, even before people were diagnosed with a specific disease.
The key to doing that lies in distributed data mining, says Tom Mitchell, a computer science professor at Carnegie Mellon University in Pittsburgh. Carnegie Mellon and the University of Pittsburgh recently fielded the Real-Time Outbreak Disease and Surveillance (RODS) system, which takes data feeds from the emergency rooms of 17 local hospitals, loads it into a database and applies statistical techniques to predict the occurrence of diseases such as anthrax and smallpox. The universities also used RODS during this year's Olympic Winter Games in Salt Lake City.
The system considers 30 to 100 variables every few minutes over a large geographic area, says project co-director Andrew Moore, who is also director of the Biomedical Security Institute in Pittsburgh. "We are looking at between 1 million and 1 trillion possible strange things going onpossible indicators of various kinds of disease," says Moore. "If we are not careful, we'll use a year's worth of supercomputer time every day."
Project members are working on better algorithms and have increased processing efficiency by a factor of 10,000 in the past year, Moore says, but more improvements are needed. The system may be expanded to look at pharmacy cash-register data, school attendance records, animal sickness data, phone call records and vehicular traffic patterns, all of which may hold real-time clues about changes in a population's health.
But gathering all that information raises privacy and confidentiality concerns. At present, the hospital data comes into a central repository where it's carefully scrubbed of information that could be used to identify anyone. Carnegie Mellon researchers are looking at ways to push that scrubbing activity out to the data source.
"How can you design a data mining system that instead of running on a central repository, allows each hospital, store and so on to keep their own records and not reveal the identities?" asks Mitchell, director of Carnegie Mellon's Center for Automated Learning and Discovery. "What you want to do is give them some software that they can use to put their own privacy restrictions on."
That concept could be applied in many domains, Mitchell says. For example, intelligence agencies could use it to allow information-sharing across departments while protecting the sources of the information, he says.
Upside Down
"Instead of archiving data and running search queries through it, we archive search queries and run data through it," says Val Jerdes, vice president for business development at Streamlogic Inc. "It's a search engine on its head."
The advantage of an inverted search engine, he claims, is that it's 6,000 times more efficient than the conventional approach. It can handle huge volumes of data that would be expensive or impossible to process using the standard method of loading data into an archive, indexing it and then retroactively querying it.
Los Altos Hills, Calif.-based Streamlogic's feed-monitoring technology "strains" the information through query rules in real time, eliminating the archival requirement entirely. A demonstration at www.streamlogic.com runs all the postings to some 50,000 Usenet news groups10 postings per second, or 2GB per daythrough a database of user-specified topics and instantly sends an alert every time one of those topics appears in a post. It also turns unstructured information into data that can put into a relational database for further analysis.
A feed-processing engine plucks out information based on user-specified topics or keywords. A feed analysis engine uses statistical techniques to analyze, categorize and summarize information for identifying trends, advertisement-targeting and other applications. The engine improves with use as it learns the most relevant words and phrases, says Streamlogic.
The future of these concepts lies in applications that others will develop with Streamlogic's tool kit, which includes a collection of "metaware" and a language similar to SQL. For example, it could be used to speed and unify the flow of data throughout an enterprise, Jerdes says.
"So when a customer's order comes in, instead of moving from one database to another in functional silos, we are able to dissolve the walls so that the order gets through to manufacturing, customer relations, financial and sales systems," he says. "And all that could happen instantaneously."
What's the Answer?
When someone types the query "What is the population of the world?" into an Internet search engine, he most likely wants the numerical answer6.2 billionnot pointers to hundreds of documents containing the words population and world. Unfortunately, today's search engines produce more document hits than answers.
But Verity Inc. is developing software that will be a lot smarter, says Prabhakar Raghavan, chief technology officer at the Sunnyvale, Calif.-based company. The approach involves putting human learning, or rules, into the software and enabling that software to teach itself in a process called machine learning.
Suppose you want to build a recruiting system that automatically extracts information from the scanned resumes of job applicants. Raghavan says specific rules could be written into the software to indicate that employment information is commonly found after the words employment, work history and experience, enumerating every possibility. Or one could train the system by giving it an initial batch of resumes annotated as to what information appears in each area of the resume.
"After it's looked at 50 or 100 resumes, it's started to figure out that all those phrases are variants on the same theme," Raghavan says.
Verity is using a relatively new technique called logistic regression classification to enable such machine learning. The best systems for information extraction use both hard-coded rules and machine learning, Raghavan says.
Verity is also working on software that can synthesize and summarize information. "That's difficult with an unstructured query like 'What percentage of Republicans in Santa Clara County are in favor of bombing Iraq?' " Raghavan says. It requires joining data from several sources and resolving conflicting information, and the technology to do that is still primitive, he says.
Raghavan says Verity software will be able to handle a query like the one about world population in about two years, but the capability to answer questions like the one about attacking Iraq will take considerably longer to develop.
![]()
Non-Obvious Relationship Awareness (NORA)
Systems Research & Developments NORA technology can take information from disparate sources about people and their activities and find obscure, nonobvious relationships. For example, it might discover that an applicant for a job at a casino shares a telephone number with a known criminal and issue an alert to the hiring manager.

Source: Systems Research & Development, Las Vegas
![]()
![]()
The Search Is On:
![]()
The Future of the Search Engine
![]()
The Search Is On
![]()
![]()

|
|
Print this Story |
|
Send Us Feedback |
|
E-mail this Story |
|
Digg this Story |
|
Slashdot this Story |
|
|
|
|
|
|
| All Zones Application Performance Zone Business Continuity Zone The File Data Management Zone Security Management Zone ITIL Best Practices Zone The SAS Zone Business Intelligence and Analytics Zone Windows Protection Zone Identity & Security Management Zone |
|
|
| ||||||||
| ||||||||
| ||||||||
|


SAS Business and Analytics ZoneNo matter where your organization is on the path toward evolving your IT infrastructure, SAS can adapt to your situation to meet your long-term enterprise intelligence needs. We can help you drive intelligence evolution to the next level, while leveraging and extending the value of your existing IT investments. Learn more in the Business and Analytics Zone See All Zones
|
Intercept Spam & Viruses With MessageLabs MessageLabs is offering a complimentary 30 day trial of its managed Anti-virus and Anti-spam security solutions. MessageLabs guarantees complete protection against all know and unknown email threats. By providing 24 hour support, your business can increase productivity and decrease risk. Register for a complimentary trial and receive a free datasheet.Download this white paper now!
|

| XenServer FREE trial Citrix XenServer is the simplest and most effective way to virtualize and provision servers. XenServer combines comprehensive server virtualization capabilities with unparalleled scalability, performance, economics, and ease-of-use. Based on the open source Xen hypervisor, XenServer delivers fast performance, easy management, and advanced features such as live migration. |

Since You AskedA weekly storage column from storage analyst, Steve Duplessie of the Enterprise Strategy Group |
| SAS Information Management Kit SAS is the leader in business intelligence and analytical software and services. Only SAS offers leading data integration, storage, analytics and business intelligence applications within a comprehensive enterprise intelligence platform. SAS gives 97 of the top 100 companies in the 2007 Fortune 500 THE POWER TO KNOW®. | Webcast: The Information Management Roadmap Imagine high-quality data, cleansed, analyzed and delivered throughout your organization. Join Computerworld, IT visionary Thornton May and a panel of experts to learn how SAS® can help you make it happen. View this webcast | Research Report: Information Management Initiatives at Midsize and Large Organizations See the top-line results of this Computerworld sponsored survey to see how IT and business leaders are handling information management implementation.Download this report | White Paper: Information Management: Better Information for Winning Decisions. This white paper explains how the SAS Information Evolution Model aids companies in assessing how they use this information to make strategic decisions and drive business. Download this white paper |
