Subscribe to our e-mail newsletters
For more info on a specific newsletter, click the title. Details will be displayed in a new window.
Computerworld Daily News (First Look and Wrap-Up)
Computerworld Blogs Newsletter
The Weekly Top 10
More E-Mail Newsletters 
Computerworld 2007Subscribe to Computerworld
40 years of the most authoritative source of news and information for IT leaders.

Search Engines -- The Future

Search engines get smarter, more powerful.
 

Sign up to receive Security Resource Alerts

April 05, 2004 -- Most information junkies would be hard-pressed to name anything that has transformed their professional lives as much as Internet search engines have. The miraculous devices can take your hot topic of the day, scan millions of Web pages and in seconds bring back product announcements, research papers, the names of experts and more—things that would be difficult or impossible to find otherwise.
But as powerful as they are, search engines have huge weaknesses. For example, a recent Google search on the word Linux took just 0.4 seconds, but it had 95 million hits. Too bad if the one you need is No. 10,000 on the list.
But researchers are poised to revolutionize search technology over the next few years. The most common thrust is to personalize search engines so that they know, for example, that if you're an IT professional and you search for mouse, you're more likely to want information about PC devices than about animals.
Adele Howe, a computer science professor at Colorado State University in Fort Collins, and Gabriel Somlo, a CSU graduate student, have built a proof of concept called QueryTracker, a software agent that sits between a user and a conventional search engine and looks for information of recurring interest, such as the latest news about a user's chronic illness. QueryTracker submits a user's query to the search engine once a day and returns results from new Web pages and pages that have changed since the previous search.
The magic in QueryTracker comes from its automatic generation of an additional daily query—which Howe says is often superior to the user's original query—based on what it learns about the user's interests and priorities over time. It filters the results of both queries for relevance and sends them to the user.
QueryTracker's ability to generate its own searches can compensate for the poorly formed queries that many users write, Howe says. "Even people knowledgeable about the Web are often either lazy or they are just not informed about how to write good queries," she says. The most common mistake: queries that are too short, like the one-word Linux search.

Jeannette Jenssen, a mathematics professor at Dalhousie University in Halifax, Nova Scotia, is taking search personalization techniques a step further, to the "crawlers" that index Web content before it can be searched. She says the popular search engines have three drawbacks: They are increasingly charging corporate users for their services, they skew results in favor of advertisers, and they often retrieve huge amounts of irrelevant information. But Jenssen's "focused crawler" indexes only pages related to prespecified topics and then tailors the rankings to the interests of the user.
For example, she says, a medical society might run the crawler nightly to index just pages relating to medicine. And it would rank the resulting hits in a way that made sense to the medical establishment, not to advertisers or average Web surfers. The crawler would get progressively better at building its nightly index by observing the behavior of the searches against it.
Other focused crawlers look for pages containing information that meets specific criteria. But Jenssen's crawler can discern hidden, or indirect, links through a process she likens to the children's search game "warmer-colder."
For example, she says, imagine a Web crawler that focuses on computer science topics. Computer science research papers often are linked to the home pages of the professors who wrote them, and their pages are linked to the professors' universities' home pages. "When the crawler gets to the university page, it searches more intently than it would at a company page," Jenssen says. "It says, 'I'm getting warmer.' It analyzes user behavior and Web paths to automatically learn these trajectories."
Filippo Menczer, a computer science professor at Indiana University in Bloomington, says conventional search engines determine a document's relevance by considering various things in isolation. They may first select a document because it contains the keywords in the query. Then, to rank the results, they may consider how many links point to the document. Better results could be obtained from considering many such "measures of relevance"—including user preferences—in combination, and in considering combinations of pages rather than single pages, Menczer says.

Such complex and powerful searches will be practical in three to five years when computers are more powerful. "We'll do brute-force, large-scale data mining over the whole Web—over many terabytes of information," says Menczer.
Data Fountain
Brute force is a pretty good description of IBM's WebFountain, a huge Linux cluster that runs 9,000 programs continuously and crawls 50 million new pages every day. But WebFountain doesn't simply index keywords; it applies natural-language analysis concepts to extract meaning from unstructured text.
For example, it determines whether an entity is a person's name, company name, location, product, price and so on, and then it attaches searchable XML metadata tags to it. "We are tagging the entire Web, all of Usenet news, all the wire services and so on," says Dan Gruhl, WebFountain's chief architect at IBM's Almaden Research Center.
The software is pretty good at extracting and tagging the semantic meaning of unstructured text, but Gruhl says more research is needed to do reliable "sentiment analysis," which, for example, would let companies automatically monitor the reputations of their products. (To read more about this feature, see "Winning the Name Game" at QuickLink 45643.)
Researchers at the Almaden center are experimenting with Sentiment Analyzer, which tries to extract opinions from online text documents. If a customer said at a Web site, "The Ford Explorer is great," that would be easy to classify, Gruhl says, but if the customer said sarcastically, "It's almost as good as the Ford Pinto," semantic analysis software would be stumped.
Making sense of that kind of statement is one of the goals of IBM's research.


QueryTracker


Print this Story Send Us Feedback E-mail this Story Digg! Digg this Story Slashdot this Story
Sidebar: Search This
Sidebar: Next Steps for Corporate Searching
Search for Tomorrow
"Apple yesterday dropped the price of the 64Gb MacBook Air by a whopping $500 ($400 less for the SSD and..." Read more...
"It's a cheaper IT Blogwatch: in which Apple cuts the price of the top-end MacBook Air. Not to mention dan..." Read more...
Read more Software posts or See all Blogs
Microsoft promises four patches next week
Google gives away home-cooked Web application security scanner
Expect iPhone, Fourth of July scams, security firm says
More top stories...
Microsoft trumpets security additions in upcoming IE8
Apple cuts price of high-end SSD MacBook Air by $500
Ultrathin showdown: Apple MacBook Air vs. Lenovo ThinkPad X300 vs. Toshiba Portege R500
All it takes is a couple hours and about $125 to breathe new life into an old laptop. Here's how.
Is Microsoft's Golden Age over? What are Gates' most memorable quotes? Find out in Computerworld's complete coverage of the end of the Bill Gates era at Microsoft.
There are some things your CIO definitely doesn't want to hear. Also don't miss the flipside, Five things you should always tell your boss.
With its latest version, Mozilla's browser continues to raise the bar for what Web browsers should be.
Reviews, analyses, how-tos, visual tours, hot issues and predictions about Microsoft's new OS.
Four years from now, the IT field will be a vastly different place. Will you be ready?
All Zones
Application Performance Zone
Business Continuity Zone
Data Center Management Zone
Enterprise-Class Security Zone
The File Data Management Zone
Grid Computing on Windows Zone
Security Management Zone
ITIL Best Practices Zone
The SAS Zone
Storage Virtualization Zone
Business Intelligence and Analytics Zone

Ads by TechWords

See your link here
Computerworld Technology Briefing: An open-source path to optimal virtualization
Download this Technology Briefing now!
(Source: Novell/IBM/Intel) Virtualization is about a lot more than just lowering total cost of ownership. In fact users that have taken an open source path to virtualization have realized the additional, mission-critical benefit of markedly reduced IT complexity, as well as a more flexible infrastructure that is easier to change to meet shifting, often unpredictable business requirements.
Download this executive briefing download
Advance your BlackBerry(R) solution management know-how this July
Advance your BlackBerry(R) solution management know-how this July
BlackBerry Technical Seminar, register today!
Go to the webcast 
Rapid application development, rapid results
Download this special report now!
(Source: Intersystems) All too many businesses suffer from IT infrastructures that are a hodge-podge of disconnected databases and applications. What's needed is the ability rapidly develop connected applications under a unified service-oriented architecture. InterSystems Ensemble integration environment and Cache database are effective tools in answering this need, delivering a rapid ROI.
Download this white paper go
White Papers
Read up on the latest ideas and technologies from companies that sell hardware, software and services.
Deploying Virtualized NetWare on Linux Whitepaper
Toward More Flexible, Next-Generation Collaboration Solutions
Driving Business Success Through Workgroup Choice and Flexibility
View more whitepapers