Keyword Searches Disappoint

Lawyers are using old search technologies that don't find all of the relevant documents.

Document dump." "Unsearchable morass." That's how Ontario Superior Court Justice Cary Boswell described the nearly 23 million pages of electronic records handed over by the prosecution in an ongoing criminal fraud case of three former Nortel Networks executives.

During a hearing last December, defense lawyers argued that the sheer amount of material provided on a hard drive -- the equivalent of 8,000 to 10,000 boxes of paper -- was so "staggering" and disorganized that it couldn't be effectively searched for information that might help the defendants.

In a ruling afterward, the judge agreed, ordering the prosecution to "re-disclose" any relevant material in a more organized fashion.

That case offers an example of the challenges legal professionals face with e-discovery. And those difficulties are compounded by the fact that typical computer searches don't find all of the relevant information in a data dump. For example, tests by the Text Retrieval Conference (TREC), an international workshop that assesses various information retrieval approaches, show that Boolean keyword searches locate only 22% to 57% of the total number of relevant documents.

"If you're conducting searches electronically, you're never going to be able to say that you turned over every stone," warns Susan Wortzman, a partner at Wortzman Nickle PC, a Toronto law firm specializing in e-discovery.

One problem with today's search tools is the prevalence of false positives. Keyword searches retrieve all the documents containing a specified term, regardless of context. The result is a collection of files that are often irrelevant to a legal team. For example, a search for the word record could turn up documents related to a Beatles album, a Guinness world record or a recorded message.

But that's not all. Documents containing inadvertent misspellings can easily fall through the cracks of a standard keyword search. Then there's the inherent ambiguity of language, the combination of text and images, and the introduction of errors by optical character recognition software -- all factors that can significantly impair the e-discovery process.

"Keyword searching is a blunt instrument," laments Patrick Oot, co-founder of the nonprofit Electronic Discovery Institute and former director of electronic discovery at Verizon Communications Inc.

The Human Factor

One way to improve document searches is to put a human expert -- who knows the topic and terminology -- in the loop, says Gordon Cormack, coordinator of TREC's legal track and a professor of computer science at the University of Waterloo.

"You can do a lot better job of searching for relevant documents if you use a combination of an expert who knows the data set working with individuals who are actually running automated search queries," agrees Jason Baron, director of litigation at the U.S. National Archives and Records Administration and a founding coordinator of the TREC legal track.

Meanwhile, vendors are scrambling to provide alternative search technologies to overcome the limitations of today's tools. E-discovery software vendor Clearwell Systems Inc., for example, has developed what it calls "transparent search," which lets users select specific variations of keywords to reduce the likelihood of false positives.

With Clearwell's tool, a user could conduct a keyword search for, say, the word false that catches derivations such as falsify and falsifying while excluding irrelevant terms, such as falsetto, that a standard keyword search might include.

Another alternative is "concept search" technology that retrieves information related to a concept rather than a keyword or phrase. For example, a concept search based on the word oil would recognize that documents about petroleum are also relevant.

"Conceptual technology is a much more effective tool than keyword searching," says Oleh Hrycko, president of H&A eDiscovery in Toronto. H&A offers a search engine called eExamine Conceptual that groups together documents that address related concepts. The company says the tool cuts search time by up to 70%.

Other search technologies rely on taxonomies of industry terms, or mathematical techniques (such as clustering and latent semantic indexing) that determine the probability that a document has a particular term or concept.

Ultimately, the best approach might be a combination of the new search technologies. "A startling statistic one of our studies revealed is that 25% of relevant documents were found by Boolean search, while 75% were found by using other methods combined," says Baron.

Despite the increasing availability of more advanced search technologies, the new tools aren't being snapped up by old-school law firms. "In terms of comfort, the more senior practitioners might pine for the good old days when they had boxes of documents [to search through]," says Richard Braman, executive director of The Sedona Conference, an e-discovery think tank in Arizona.

For now, the outcomes of court cases all too often hinge on old-fashioned keyword searches. Warns Cormack: "We have vast needles in haystacks, and we're not using state-of-the-art search techniques to find them."

Waxer is a freelance writer in Toronto. You can contact her at

Copyright © 2010 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon