April 05, 2004 -- Most information junkies would be hard-pressed to name anything that has transformed their professional lives as much as Internet search engines have. The miraculous devices can take your hot topic of the day, scan millions of Web pages and in seconds bring back product announcements, research papers, the names of experts and morethings that would be difficult or impossible to find otherwise.
But as powerful as they are, search engines have huge weaknesses. For example, a recent Google search on the word Linux took just 0.4 seconds, but it had 95 million hits. Too bad if the one you need is No. 10,000 on the list.
But researchers are poised to revolutionize search technology over the next few years. The most common thrust is to personalize search engines so that they know, for example, that if you're an IT professional and you search for mouse, you're more likely to want information about PC devices than about animals.
Adele Howe, a computer science professor at Colorado State University in Fort Collins, and Gabriel Somlo, a CSU graduate student, have built a proof of concept called QueryTracker, a software agent that sits between a user and a conventional search engine and looks for information of recurring interest, such as the latest news about a user's chronic illness. QueryTracker submits a user's query to the search engine once a day and returns results from new Web pages and pages that have changed since the previous search.
The magic in QueryTracker comes from its automatic generation of an additional daily querywhich Howe says is often superior to the user's original querybased on what it learns about the user's interests and priorities over time. It filters the results of both queries for relevance and sends them to the user.
QueryTracker's ability to generate its own searches can compensate for the poorly formed queries that many users write, Howe says. "Even people knowledgeable about the Web are often either lazy or they are just not informed about how to write good queries," she says. The most common mistake: queries that are too short, like the one-word Linux search.
Jeannette Jenssen, a mathematics professor at Dalhousie University in Halifax, Nova Scotia, is taking search personalization techniques a step further, to the "crawlers" that index Web content before it can be searched. She says the popular search engines have three drawbacks: They are increasingly charging corporate users for their services, they skew results in favor of advertisers, and they often retrieve huge amounts of irrelevant information. But Jenssen's "focused crawler" indexes only pages related to prespecified topics and then tailors the rankings to the interests of the user.
For example, she says, a medical society might run the crawler nightly to index just pages relating to medicine. And it would rank the resulting hits in a way that made sense to the medical establishment, not to advertisers or average Web surfers. The crawler would get progressively better at building its nightly index by observing the behavior of the searches against it.
Other focused crawlers look for pages containing information that meets specific criteria. But Jenssen's crawler can discern hidden, or indirect, links through a process she likens to the children's search game "warmer-colder."
For example, she says, imagine a Web crawler that focuses on computer science topics. Computer science research papers often are linked to the home pages of the professors who wrote them, and their pages are linked to the professors' universities' home pages. "When the crawler gets to the university page, it searches more intently than it would at a company page," Jenssen says. "It says, 'I'm getting warmer.' It analyzes user behavior and Web paths to automatically learn these trajectories."
Filippo Menczer, a computer science professor at Indiana University in Bloomington, says conventional search engines determine a document's relevance by considering various things in isolation. They may first select a document because it contains the keywords in the query. Then, to rank the results, they may consider how many links point to the document. Better results could be obtained from considering many such "measures of relevance"including user preferencesin combination, and in considering combinations of pages rather than single pages, Menczer says.
Such complex and powerful searches will be practical in three to five years when computers are more powerful. "We'll do brute-force, large-scale data mining over the whole Webover many terabytes of information," says Menczer.
Data Fountain
Brute force is a pretty good description of IBM's WebFountain, a huge Linux cluster that runs 9,000 programs continuously and crawls 50 million new pages every day. But WebFountain doesn't simply index keywords; it applies natural-language analysis concepts to extract meaning from unstructured text.
For example, it determines whether an entity is a person's name, company name, location, product, price and so on, and then it attaches searchable XML metadata tags to it. "We are tagging the entire Web, all of Usenet news, all the wire services and so on," says Dan Gruhl, WebFountain's chief architect at IBM's Almaden Research Center.
The software is pretty good at extracting and tagging the semantic meaning of unstructured text, but Gruhl says more research is needed to do reliable "sentiment analysis," which, for example, would let companies automatically monitor the reputations of their products. (To read more about this feature, see "Winning the Name Game" at QuickLink 45643.)
Researchers at the Almaden center are experimenting with Sentiment Analyzer, which tries to extract opinions from online text documents. If a customer said at a Web site, "The Ford Explorer is great," that would be easy to classify, Gruhl says, but if the customer said sarcastically, "It's almost as good as the Ford Pinto," semantic analysis software would be stumped.
Making sense of that kind of statement is one of the goals of IBM's research.
"Apple yesterday dropped the price of the 64Gb MacBook Air by a whopping $500 ($400 less for the SSD and..."
Read more...
"It's a cheaper IT Blogwatch: in which Apple cuts the price of the top-end MacBook Air. Not to mention dan..."
Read more... Read more Software posts or See all Blogs
Is Microsoft's Golden Age over? What are Gates' most memorable quotes? Find out in Computerworld's complete coverage of the end of the Bill Gates era at Microsoft.
Computerworld Technology Briefing: An open-source path to optimal virtualization
Download this Technology Briefing now! (Source: Novell/IBM/Intel) Virtualization is about a lot more than just lowering total cost of ownership. In fact users that have taken an open source path to virtualization have realized the additional, mission-critical benefit of markedly reduced IT complexity, as well as a more flexible infrastructure that is easier to change to meet shifting, often unpredictable business requirements.
Download this executive briefing
Advance your BlackBerry(R) solution management know-how this July
Advance your BlackBerry(R) solution management know-how this July
BlackBerry Technical Seminar, register today!
Go to the webcast
Rapid application development, rapid results
Download this special report now! (Source: Intersystems) All too many businesses suffer from IT infrastructures that are a hodge-podge of disconnected databases and applications. What's needed is the ability rapidly develop connected applications under a unified service-oriented architecture. InterSystems Ensemble integration environment and Cache database are effective tools in answering this need, delivering a rapid ROI.
Download this white paper
White Papers
Read up on the latest ideas and technologies from companies that sell hardware, software and services.
Computerworld Technology Briefing: Meetings @ the Speed of Business For large organizations, Web conferencing gives a major boost to collaboration among far-flung offices. For smaller companies, experts say Web conferencing is no longer a luxury but a necessity for everything from webinars to customer presentations. But the real value lies in saving soft costs and in increases in productivity. Download this briefing