12 predictive analytics screw-ups

Make these mistakes and you won't need an algorithm to predict the outcome

1 2 3 4 5 6 Page 2
Page 2 of 6

3. Don't proceed until your data is the best it can be.

People often operate under the misconception that they must have their data perfectly organized, without any holes, disorder or missing values, before they can start a predictive analytics project.

Source: Flickr/ChicagoSage.

One global petrochemical company, an Elder Research client, had just begun a predictive analytics project with a great potential return on investment when data scientists discovered that the state of the operations data was much worse than they had initially thought.

In this case, a key target value was missing. Had the business waited to gather new data, the project would have been delayed for at least a year. "A lot of companies would have stopped right there. I see this kill more projects than any other mistake," says Deal.

But data scientists are used to dealing with messy and incomplete data, and they have methodologies that, in many cases, allow them to work around the problem. This time, the business moved forward, and eventually the data scientists found a way to derive the missing target values from other data, according to John Ainsworth, data scientist at Elder Research.

The project is now on track to deliver major cost savings by accurately predicting failures, avoiding costly shutdowns and identifying exactly where to apply expensive preventive maintenance procedures. Had they waited for perfect data, however, it never would have happened, Deal says, "because priorities change and the data never gets fixed."

4. When reviewing data quality, don't bother to take out the garbage.

Eric Siegel, president of the consultancy Prediction Impact and author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, once worked with a Fortune 1000 financial services company that wanted to predict which call-center staff hires would stay on the job longest.

   Garbage truck
Source: REUTERS/Melissa Renwick.

At first blush, the historical data appeared to show that employees without a high-school diploma were 2.6 times more likely to stay on the job for at least nine months than were employees with other educational backgrounds. "We were on the verge of recommending that the client begin to prioritize hiring high-school dropouts," Siegel says.

But there were two problems. First, the data, which had been manually keyed in from job applicant resumes, had been labeled inconsistently. One data entry person checked off all educational levels that applied, while another checked only the highest degree completed.

Compounding the problem was the fact that, for some reason, the latter person had labeled data from more of the resumes of people who stayed the longest than did the former. Those issues could have been avoided by making sure labelers were assigned a random group of resumes to key in and that each person used the same labeling methodology.

But the bigger message is this, says Siegel: "Garbage in, garbage out. Be sure to carefully QA your data to ensure its integrity."

5. Use data from the future to predict the future.

The problem with data warehouses is that they're not static: Information is constantly changed and updated. But predictive analytics is an inductive learning process that relies on analysis of historical data, or "training data," to create models. So you need to recreate the state the data was in at the earlier time in the customer lifecycle. If data is not date-stamped and time-stamped, it's easy to include data from the future that generates misleading results.

   Delorean auto
Source: Flickr/Clauz Jardim

That's what happened to a regional auto club when it set about the task of building a model it could use to predict which of its members would be most likely to buy its insurance product.

For modeling purposes, the club needed to recreate what the data set was like early on, prior to when members had bought or declined to buy insurance, and exclude subsequent data. The organization had created a decision tree that included a text variable containing phone, fax or email data. When the variable contained any text, there was 100% certainty that those members would later buy the insurance.

"We were assured that the indicator was known at the time" -- before the members had purchased the insurance -- but auto-club staffers "couldn't tell us what it meant," says Elder, who worked on the project. Knowing this was too good to be true, he continued to ask questions until he found someone in the organization who knew the truth: The variable represented how members had cancelled their insurance -- by phone, fax or email. "You don't cancel insurance before you buy it," Elder says. So when you do modeling you have to lock up some of your data.

1 2 3 4 5 6 Page 2
Page 2 of 6
It’s time to break the ChatGPT habit
Shop Tech Products at Amazon