Computerworld
Quick Menu
Search



Ads by TechWords

See your link here


Subscribe to our e-mail newsletters
For more info on a specific newsletter, click the title. Details will be displayed in a new window.
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
More E-Mail Newsletters 
Computerworld 2007Subscribe to Computerworld
40 years of the most authoritative source of news and information for IT leaders.

The Search Is On

Research labs are finding smarter ways to sift and analyze huge databases. Here are four of the coolest projects.
 

Sign up to receive Business Intelligence Resource Alerts

April 15, 2002 (Computerworld) -- Researchers are inventing better ways to find and make sense of information. Efforts to improve data mining and searching are being driven by the deluge of information in this increasingly networked world and by companies' need to respond ever faster to changes. And, sadly, the field has gotten a big boost from the terrorist attacks on the U.S. last fall.

Computerworld looked at some of this research and found companies perfecting techniques for machine learning, real-time analysis of data flows, distributed data mining and the discovery of "nonobvious" relationships.

Known Associates

Systems Research & Development (SRD) developed its Non-Obvious Relationship Awareness (NORA) technology to help casinos identify cheaters by correlating information from multiple sources about relationships and earlier transactions.

Las Vegas-based SRD, which received funding from the CIA, is now developing several NORA plug-ins to reach further into the world of criminals and terrorists. Last month, the company unveiled a "degrees of separation" capability that finds deeper connections among people.

"It will tell you that the Drug Enforcement Agency's agent's college roommate's ex-wife's current husband is the drug lord," says Jeff Jonas, chief technology officer at SRD. NORA can bridge up to 30 such links, he says.

Prabhakar Raghavan, CTO at Verity Inc.
Prabhakar Raghavan, CTO at Verity Inc.
The new NORA module uses streaming technology that scans data and extracts information in real time as it flows by. That would allow it to, for example, instantly discover that a man at an airline ticket counter shares a phone number with a known terrorist and then issue an alert before he can board his flight. Jonas calls it "perpetual analytics," to distinguish it from periodic queries against an occasionally updated database.

SRD is also developing the concept of "cascading" NORA data warehouses for really big problems.

For example, Jonas says, each airline might have a copy of NORA processing its passenger data and sending the summarized results to a midtier NORA system at the Federal Aviation Administration. Car rental agencies might send their NORA results to a rental car association. And the U.S. Immigration and Naturalization Service could collect data from ports of entry.

All three midtier NORA systems would then send transactions to the top-tier system at the Office of Homeland Security in Washington. They would communicate with one another in a "zero administration" arrangement in which rules and filters would determine whether a piece of information got passed up or down the chain, Jonas says.

Outbreak Detection

If a bioterrorist attack occurred, it would be critical for health and law enforcement officials to find out quickly, even before people were diagnosed with a specific disease.

The key to doing that lies in distributed data mining, says Tom Mitchell, a computer science professor at Carnegie Mellon University in Pittsburgh. Carnegie Mellon and the University of Pittsburgh recently fielded the Real-Time Outbreak Disease and Surveillance (RODS) system, which takes data feeds from the emergency rooms of 17 local hospitals, loads it into a database and applies statistical techniques to predict the occurrence of diseases such as anthrax and smallpox. The universities also used RODS during this year's Olympic Winter Games in Salt Lake City.

The system considers 30 to 100 variables every few minutes over a large geographic area, says project co-director Andrew Moore, who is also director of the Biomedical Security Institute in Pittsburgh. "We are looking at between 1 million and 1 trillion possible strange things going on—possible indicators of various kinds of disease," says Moore. "If we are not careful, we'll use a year's worth of supercomputer time every day."

Project members are working on better algorithms and have increased processing efficiency by a factor of 10,000 in the past year, Moore says, but more improvements are needed. The system may be expanded to look at pharmacy cash-register data, school attendance records, animal sickness data, phone call records and vehicular traffic patterns, all of which may hold real-time clues about changes in a population's health.

But gathering all that information raises privacy and confidentiality concerns. At present, the hospital data comes into a central repository where it's carefully scrubbed of information that could be used to identify anyone. Carnegie Mellon researchers are looking at ways to push that scrubbing activity out to the data source.

"How can you design a data mining system that instead of running on a central repository, allows each hospital, store and so on to keep their own records and not reveal the identities?" asks Mitchell, director of Carnegie Mellon's Center for Automated Learning and Discovery. "What you want to do is give them some software that they can use to put their own privacy restrictions on."

That concept could be applied in many domains, Mitchell says. For example, intelligence agencies could use it to allow information-sharing across departments while protecting the sources of the information, he says.

Upside Down

"Instead of archiving data and running search queries through it, we archive search queries and run data through it," says Val Jerdes, vice president for business development at Streamlogic Inc. "It's a search engine on its head."

The advantage of an inverted search engine, he claims, is that it's 6,000 times more efficient than the conventional approach. It can handle huge volumes of data that would be expensive or impossible to process using the standard method of loading data into an archive, indexing it and then retroactively querying it.

Los Altos Hills, Calif.-based Streamlogic's feed-monitoring technology "strains" the information through query rules in real time, eliminating the archival requirement entirely. A demonstration at www.streamlogic.com runs all the postings to some 50,000 Usenet news groups—10 postings per second, or 2GB per day—through a database of user-specified topics and instantly sends an alert every time one of those topics appears in a post. It also turns unstructured information into data that can put into a relational database for further analysis.

A feed-processing engine plucks out information based on user-specified topics or keywords. A feed analysis engine uses statistical techniques to analyze, categorize and summarize information for identifying trends, advertisement-targeting and other applications. The engine improves with use as it learns the most relevant words and phrases, says Streamlogic.

The future of these concepts lies in applications that others will develop with Streamlogic's tool kit, which includes a collection of "metaware" and a language similar to SQL. For example, it could be used to speed and unify the flow of data throughout an enterprise, Jerdes says.

"So when a customer's order comes in, instead of moving from one database to another in functional silos, we are able to dissolve the walls so that the order gets through to manufacturing, customer relations, financial and sales systems," he says. "And all that could happen instantaneously."

What's the Answer?

When someone types the query "What is the population of the world?" into an Internet search engine, he most likely wants the numerical answer—6.2 billion—not pointers to hundreds of documents containing the words population and world. Unfortunately, today's search engines produce more document hits than answers.

But Verity Inc. is developing software that will be a lot smarter, says Prabhakar Raghavan, chief technology officer at the Sunnyvale, Calif.-based company. The approach involves putting human learning, or rules, into the software and enabling that software to teach itself in a process called machine learning.

Suppose you want to build a recruiting system that automatically extracts information from the scanned resumes of job applicants. Raghavan says specific rules could be written into the software to indicate that employment information is commonly found after the words employment, work history and experience, enumerating every possibility. Or one could train the system by giving it an initial batch of resumes annotated as to what information appears in each area of the resume.

"After it's looked at 50 or 100 resumes, it's started to figure out that all those phrases are variants on the same theme," Raghavan says.

Verity is using a relatively new technique called logistic regression classification to enable such machine learning. The best systems for information extraction use both hard-coded rules and machine learning, Raghavan says.

Verity is also working on software that can synthesize and summarize information. "That's difficult with an unstructured query like 'What percentage of Republicans in Santa Clara County are in favor of bombing Iraq?' " Raghavan says. It requires joining data from several sources and resolving conflicting information, and the technology to do that is still primitive, he says.

Raghavan says Verity software will be able to handle a query like the one about world population in about two years, but the capability to answer questions like the one about attacking Iraq will take considerably longer to develop.

Non-Obvious Relationship Awareness (NORA)
Systems Research & Development’s NORA technology can take information from disparate sources about people and their activities and find obscure, nonobvious relationships. For example, it might discover that an applicant for a job at a casino shares a telephone number with a known criminal and issue an alert to the hiring manager. Non-Obvious Relationship Awareness (NORA)
Source: Systems Research & Development, Las Vegas



The Search Is On:

The Future of the Search Engine

The Search Is On



Special Report


Taming Data Chaos
Stories in this report:



Print this Story Send Us Feedback E-mail this Story Digg! Digg this Story Slashdot this Story
Mozilla updates Firefox 3.1 with Alpha 2 build
Microsoft explains Seinfeld-Windows TV ad: just a 'teaser'
Mozilla: Firefox is faster than Chrome
More top stories...
iPhone 3G owner sues Apple, AT&T over dropped calls, app crashes
At 10, Google reiterates commitment to CIOs
Analysts: Google spreading itself too thin
Users of Windows XP SP3 who try out IE8 Beta 2 won't be able to uninstall either one under certain circumstances.
Google has gone from innovative upstart to fat-and-happy industry leader in what seems like record time. Preston Gralla explains.
Microsoft's latest beta of IE8 includes better tab management, new services such as Web Slices and Accelerators, and the new 'porn mode.'
These leading-edge graduate schools are moving at the pace of the IT workplace, delivering coursework that's relevant to today's IT professionals.
Reviews, analyses, how-tos, visual tours, hot issues and predictions about Microsoft's new OS.
Four years from now, the IT field will be a vastly different place. Will you be ready?
All Zones
Application Performance Zone
Business Continuity Zone
The File Data Management Zone
Security Management Zone
ITIL Best Practices Zone
The SAS Zone
Business Intelligence and Analytics Zone
Windows Protection Zone
Identity & Security Management Zone

Ads by TechWords

See your link here
Speeding the time to intelligence
Get this Computerworld report free for a limited time, compliments of SAS.
Time To Intelligence -- a concept defining how long it takes to get accurate and timely information into the hands of workers who need it most. Do it slower than your competitors and your company is toast. Do it faster, you scorch them. Business Intelligence is the key to optimizing Time To Intelligence, and success there is a combination of people, policies, and technology.
Download this executive briefing download
From Laggard to Leader: Transforming the Data Center
From Laggard to Leader: Transforming the Data Center
Register for this complimentary live webcast today!
Go to the webcast 
Rapid application development, rapid results
Download this special report now!
(Source: Intersystems) All too many businesses suffer from IT infrastructures that are a hodge-podge of disconnected databases and applications. What's needed is the ability rapidly develop connected applications under a unified service-oriented architecture. InterSystems Ensemble integration environment and Cache database are effective tools in answering this need, delivering a rapid ROI.
Download this white paper go
White Papers
Read up on the latest ideas and technologies from companies that sell hardware, software and services.
Death to PST: Hidden Cost of Email Mismanagement
Extend, Replace, or Convert; which is the best way forward for COBOL Applications?
The Trend from Unix to Linux in SAP Data Centers
View more whitepapers 

SAS Information Management Kit

SAS is the leader in business intelligence and analytical software and services. Only SAS offers leading data integration, storage, analytics and business intelligence applications within a comprehensive enterprise intelligence platform. SAS gives 97 of the top 100 companies in the 2007 Fortune 500 THE POWER TO KNOW®.

Webcast: The Information Management Roadmap
Imagine high-quality data, cleansed, analyzed and delivered throughout your organization. Join Computerworld, IT visionary Thornton May and a panel of experts to learn how SAS® can help you make it happen.

View this webcast 
Research Report: Information Management Initiatives at Midsize and Large Organizations
See the top-line results of this Computerworld sponsored survey to see how IT and business leaders are handling information management implementation.

Download this report 
White Paper: Information Management: Better Information for Winning Decisions.
This white paper explains how the SAS Information Evolution Model aids companies in assessing how they use this information to make strategic decisions and drive business.

Download this white paper