Big data analytic systems are reputed to be capable of finding a needle in a universe of haystacks without having to know what a needle looks like.
Even the simplest part of that process – sorting all the data available into Haystacks and Not Haystacks so the analytics can at least work with data that is relevant – requires a topical analysis that uses the metadata accompanying each giant pile of data to classify each bit according to topic as well as source, format and other criteria.
The very best ways to sort large databases of unstructured text is to use a technique called Latent Dirichlet allocation (LDA) – a modeling technique that identifies text within documents as belonging to a limited number of still-unknown topics, groups them according to how likely it is that they refer to the same topic, then backtracks to identify what those topics actually are. (Here's the full explanation in the Journal of Machine Learning Research; here's Wikipedia's. )
LDA is "the state of the art in topic modeling, according to analysis published Thursday in the American Physical Society's journal Physical Review X, which said that, in the 10 years since its introduction, LDA had become one of the most common ways to accomplish the computationally difficult problem of classifying specific parts of human language automatically into a context-appropriate category.
Unfortunately, LDA is also inaccurate enough at some tasks that the results of any topic model created with it are essentially meaningless, according to Luis Amaral, a physicist whose specialty is the mathematical analysis of complex systems and networks in the real world and one of the senior researchers on the multidisciplinary team from Northwestern University that wrote the paper.
The team tested LDA-based analysis with repeated analyses of the same set of unstructured data – 23,000 scientific papers and 1.2 million Wikipedia articles written in several different languages.
Even worse than being inaccurate, the LDA analyses were inconsistent, returning the same results only 80 percent of the time even when using the same data and the same analytic configuration.
Accuracy of 90 percent with 80 percent consistency sounds good, but the scores are "actually very poor, since they are for an exceedingly easy case," Amaral said in an announcement from Northwestern about the study.
Applied to messy, inconsistently scrubbed data from many sources in many formats – the base of data for which big data is often praised for its ability to manage – the results would be far less accurate and far less reproducible, according to the paper.
"Our systematic analysis clearly demonstrates that current implementations of LDA have low validity," the paper reports (full text PDF here).
The team created an alternative method called TopicMapping, which first breaks words down into bases (treating "stars" and "star" as the same word), then eliminates conjunctions, pronouns and other "stop words" that modify the meaning but not the topic, using a standardized list.
Then the algorithm builds a model identifying words that often appear together in the same document and use the proprietary Infomap natural-language processing software to assign those clusters of words into groups identified as a "community" that define the topic. Words could appear in more than one topic area.
The new approach delivered results that were 92 percent accurate and 98 percent reproducible, though, according to the paper, it only moderately improved the likelihood that any given result would be accurate.
The real point was not to replace LDA with TopicMapping, but to demonstrate that the topic-analysis method that has become one of the most commonly used in big data analysis is far less accurate and far less consistent than previously believed.
The best way to improve those analyses, according to Amaral, is to apply techniques common in community detection algorithms – which identify connections among specific variables and use those to help categorize or verify the classification of those that aren't clearly in one group or another.
Without that kind of improvement – and real-world testing of the results of big data analyses – companies using LDA-based text analysis could be making decisions based on results whose accuracy they can't know for sure.
"Companies that make products must show that their products work," Amaral said in the Northwestern release. "They must be certified. There is no such case for algorithms. We have a lot of uninformed consumers of big data algorithms that are using tools that haven't been tested for reproducibility and accuracy."