By Roger M. Stein
Before companies can profit from big data, they often must deal with bad data. There may indeed be gold in the mountains of information that firms collect today, but there also are stores of contaminated or "noisy" data. In large organizations, especially financial institutions, data often suffer from mislabeling, omissions, and other inaccuracies. In firms that have undergone mergers or acquisitions, the problem is usually worse.
Contaminated data is a fact of life in statistics and econometrics. It is tempting to ignore or throw out bad data, or to assume that it can be "fixed" (or even identified) somehow. In general, this is not the case.
I have been studying and writing about how to clean up and optimize data for big data analysis since the early 1990s. At this point I am forced to concede that figuring out how to clean, transform and recast real-world data to make it informative and actionable is as much an art as a science. It turns out that a big chunk of the time we spend in doing data analytics is spent cleaning and recoding the data we work with so that our algorithms and queries can give us sensible clues to what might really be going on under the surface. (The other big chunk of time goes to problem formulation – a subject for another posting.)
Data corruption is particularly concerning when noisy data are used to test a predictive model. Data problems here can be especially acute since we are using the data, for example, to determine the degree to which we can trust a new predictive model, or whether we should recalibrate an existing model or take it off-line entirely. If the test is flawed, so will be our conclusions.
So it would seem counterintuitive for me to say that under certain conditions, even bad data can be put to good use for testing models. One of my recent papers demonstrates that if we know a little bit about how and why data are noisy, we can adjust the results of tests and use the results from the noisy data to improve decision-making about predictive models.
One of the big applications of big data at commercial banks involves predicting which borrowers will default and which ones will not. This application is endemic in finance. (In fact, the final project in my credit class involves having students build a bare-bones default model using hundreds of thousands of bank records.)
Until about fifteen years ago, default models were tested anecdotally: Analysts would pick a few dozen defaulted firms and examine whether a proposed model gave "bad" scores to them. However, at the very end of the last century, researchers began to import techniques for evaluating medical tests and radar detection, to finance.
In medicine, for example, doctors and researchers use these techniques to determine which diagnostic tests are most effective. Although a medical test may be generally reliable, it will sometimes produce false positives and false negatives. By plotting the percentages of these errors on a graph, researchers can determine the most effective ways to interpret test results. Similar techniques generalize nicely to finance, where the diagnostic test is replaced with, say, a default model and "disease status" is replaced with "default status."
These tests tend to work fine when data are reliable, but they can produce questionable results when data are noisy. Unfortunately, noisy data are fairly common. For example, in the case of default prediction models, historical data on borrower outcomes are sometimes mislabeled—with either good loans classified as defaults or, more often, bad loans classified as non-defaults. When banks convert to new formats or when institutions combine, errors multiply.
In a recent paper I developed an approach that banks can use to test models (PDF), even when they know the test data are contaminated. By taking into account the fact that some of the data used to test a model are erroneous and adjusting the results of a test to reflect that, analysts can (in expectation) account for data noise.
Bad Data or Bad Model?
Of course, banks have known for years that significant portions of their data are incomplete, wrongly coded or corrupted. But analysts often assumed that the bad data would affect all models equally and thus "cancel out" when testing two models. My results show that this is not true. One of the interesting consequences of this research is that the accuracy of better performing models deteriorates faster than that of poorer performing models when corrupted data are introduced. As data get noisier, it becomes harder to tell the difference between a poorly performing model and a very good one.
Although these results can be generalized to many other "discrete choice" settings outside of finance, to be clear, they only provide guidance on one narrow real-world challenge in model testing. Indeed, this has been a rich and active area of research in the statistics and econometrics literature for decades and has led to far broader and more ambitious results than the new ones I wrote about.
However, I still find these new results encouraging. They suggest that even when data are contaminated and mislabeled in realistic but unrecoverable ways, with some knowledge of the extent and structure of data noise, firms can still use these "bad" data to better understand and test their predictive models. It would be convenient if data were always clean and accurate. This is typically not the case and the results show that this fact cannot be simply ignored. In some settings though, with proper adjustments and interpretations, bad data can still lead to good decisions.
Roger M. Stein is a Senior Lecturer in Finance at the MIT Sloan School of Management, a Research Affiliate at the MIT Laboratory for Financial Engineering, and Chief Analytics Officer at State Street Global Exchange.