Harvard takes down a Factiva-powered text-mining operation

The Crimson, the student-run newspaper at Harvard, has a report of an unusual incident in a campus library. Administrators at the Harvard Business School library were forced to block a user's IP address from accessing Factiva, an online database of news articles and other text documents, after determining that the user had downloaded millions of articles in the span of a few months. From The Crimson:

The mystery user downloaded an average of 55,000 documents per day, according to Lydia Petersen, a content manager for HBS’s Baker Library. The user retrieved the documents at a rate as high as four per second, which led Factiva and library officials to believe that an automated script controlled the downloads. The use of such a script is prohibited by Factiva.

Why would someone want so many documents? One possible use: "Scraper" blogs. These are bogus blogs -- frequently hosted on Google's Blogger/blogspot service -- whose owners copy articles from news sites and other sources, and then illegally repost them in order to attract traffic and AdSense revenue from Google. Some scrapers also redirect users to illicit sites hawking things like porn and pills.

But the Harvard Business School suspects that the user in question was interested in something else entirely: Text mining. From the article:

Petersen said that the user might have been downloading articles for "text-mining," a research method that uses complex natural language processing to extract information from, and identify patterns in, large aggregations of text.

"Text-mining is increasingly becoming an legitimate research method," Petersen said. "Vendors and academicians are going to have to come to some understanding about this use. Academicians need these texts to do this kind of research, but at the same time vendors need to protect their intellectual property."

I sympathize with the unidentified user at the Harvard Business School. Computer-assisted text analysis is an extremely valuable tool for identifying trends in news coverage, political texts, and other documents. Unfortunately, the tools available to conduct text analysis are still quite limited. I've used text analysis to examine BusinessWeek's coverage of Second Life, and also completed a detailed content analysis of Chinese state-run media coverage using LexisNexis Academic, a similar service to Factiva. For brief searches that are intended to pull just a few results, the Web interfaces for the two databases are fine. But both services hobble users who attempt to download too many articles, as I discovered in my research relating to China's Xinhua news agency (refered to as the New China News Agency, or NCNA, here):

The more I use Lexis Nexis, the more I am aware of the limitations of the interface and the results that are displayed. When trying to gather monthly totals of NCNA English items, the error message that results when more than 1000 hits are returned causes lots of problems for me. By the early 90s, each month typically resulted in more than 5000 NCNA news items. Practically speaking, it meant more than 100 searches per year, compared to less than 40 per year in the early 80s. If I could perform SQL queries on the LexisNexis database, instead of using the crappy [LexisNexis Academic] Web form, I could have had the same results in less than an hour.

A solution that would help some researchers while protecting the intellectual property concerns of Factiva and LexisNexis would be for these companies to create (or buy) additional software tools that would enable users to run various text analysis schemes on large pools of articles in the Web interface, without having to actually download the articles to users' computers. The demand for such tools is apparent, and if Factiva and LexisNexis don't satisfy the demand, there's a good chance that a well-known upstart that's getting into the business of archiving old news articles and other text content -- namely, Google -- will dominate this new market.

Copyright © 2007 IDG Communications, Inc.

7 inconvenient truths about the hybrid work trend
Shop Tech Products at Amazon