The lifeblood of the modern enterprise is information. This isn't news. But as organizations collect more and more information from different sources and applications, it's increasingly difficult to deal with that information. We know how to work with databases, data marts and data warehouses, because information in those places is carefully structured and massaged. (Read the Data Warehousing QuickStudy>.) But businesses also need to work with a wealth of unstructured information from sources such as document libraries, spreadsheets, e-mail and instant messaging archives, electronic forms and records, publicly available Web pages and commercial information services.
Two elements are key to this discussion. First is the unstructured nature of content: Organizations have to handle streams of what might seem to be random text instead of the carefully delineated and validated fields that we're used to in "normally" managed data.
The second consideration is that companies are getting this information from multiple sources, both inside and outside the enterprise. Each data source has its own organization and format, and most were designed for a single, stand-alone purpose, not to be part of an integrated data collection. Thus, these repositories tend to be silos, independent of one another, and don't easily work well together.
We rely on a growing number of these data sources, and we need to be able to use new ones as they appear without having to rewrite our applications and tools.
The simple-minded answer to this problem is to aggregate all the data into a single, universal database or data warehouse. Unfortunately, creating such a central repository is a slow and expensive process. Maintaining and updating that repository is a job that could give any IT manager nightmares. And we haven't even addressed the issues of scalability and who owns the information. Clearly, a better, more efficient strategy is called for.