I glibly wrote a blog post Sunday saying how confident I was that data was pointing to a Hillary Clinton win. Pretty much all the polls were showing that; almost all serious political data analysts and models said the same.
What went wrong?
I'll leave political reasons to others. From a data-science perspective, though, I'm interested in learning something from this data debacle. The key issue I see is not fully understanding uncertainty when it comes to forecasting elections - a good lesson for anyone who tries to make predictions from data.
Things are pretty straightforward if you just want to measure public opinion: You get a random sample of the population, you ensure as best you can that it's a representative sample, you ask unbiased questions and you do your statistical calculations.
But there's an extra piece when it comes to election forecasting: Measuring overall "public opinion" isn't the same as measuring public opinion among people who are actually going to vote.
So there are two complexities here, not just one. You not only need to make sure your sample properly reflects the public; you also need to make sure your model of "the public" accurately predicts who will turn out and vote. Some people who say they're going to vote may not end up casting a ballot.
Another factor too often ignored: Here in the U.S., people's access to the right to vote is not equal. One study found that minority voters are six times as likely as white voters to have to wait more than an hour to vote. Some people might very well intend to go vote, but were not able (or willing) to wait so long. That guarantees a mismatch between polled likely voters and actual results. How much of an effect this had this year is hard to say. It likely wasn't zero, but I'm not sure you can say that's the full story.
When Brexit polls didn't predict final results, I chalked it up to the difficulty of modeling voter turnout in a one-off referendum (and thus no comparable voter history to help). And in fact, given the complexities of modeling, the polls weren't that far off (a few were within margin of error).
Something seems to be going on.
Nate Silver was harshly criticized by saying he thought there were three equally likely election outcomes: Clinton solid win, Clinton blowout or Trump close win. To many, that sounded like "I'm covering all my bases whatever happens," when other forecasters were sticking their necks out and issuing more solid predictions. But what he was telling people who yearned for a hard number was that there was a lot more uncertainty in this data than many realized.
Was it just a strange election cycle where voter preference was malleable? Or is this systemic?
Mobile-phone-only households, a known polling challenge, add to the complexity of proper sampling, but they're far from the only issue. I think it's becoming more difficult to get accurate samples in part because people who won't talk to pollsters aren't the same as people who will.
We know that some people simply don't want to take polls, whether for political reasons, privacy or they just don't like being bothered. That's fine as long as folks who opt out are roughly the same as the rest. But are they? Not wanting to be interrupted during dinnertime is probably a near-universal trait. But a Fox-News-watching, mainstream-media-hating Trump supporter may be far less likely to chat with a newspaper pollster than an MSNBC-watching Democrat, even if both are 35-year-old college-educated suburban white women. Correcting for race, gender, ethnicity and age isn't going to solve this problem.
One high-profile polling error may just be a fluke, but a string of them makes me considerably less confident in using this data for election forecasts. I'm not arguing "garbage in, garbage out" -- it's not the data's fault if I don't properly understand what it can and cannot tell me. It's more that "this data can't necessarily predict what I want it to, unless I have very good reason to believe polls' turnout models are accurate."