I often learn as much by talking to people in between sessions at a conference as I do when listening to the presentations, and this week's Predictive Analytics World event has been no exception. I had the opportunity to see and chat with some very smart people Monday, and here, in no particular order, are five interesting takeaways from those conversations. The conference continues Tuesday at the Seaport World Trade Center in Boston.
- Data Scientists really aren't. That is, they're not scientists in the traditional sense, says Andrew Fast, who carries the title at Elder Research Inc. While Fast figures the term was appropriated from academia at some point, what data scientists do in a business context is far from academic. It's about engineering a solution to a business problem using an iterative process that gradually hones in on the solution. The process also requires the application of business domain expertise. So what should we call him? A more descriptive title might be "applied statistician" or "data engineer," he says.
- The future of analytics is multimodal. While most people focus data mining efforts on a single data type, targeting just structured data, or just textual data, multimodal interactions that include structured, unstructured and semi-structured data types such as spatial or temporal data (time series data such as stock history) are the wave of the future, says Fast.
- It's not just you: Data silos are still a huge barrier for many analytics projects. Many organizations can't even get one data type together from all sources, never mind thinking about a multimodal project. One attendee from a large insurance company said organizational and data silos still linger from the industry's expansion into other financial services areas long ago. At KeyBank even its analytics group was fragmented. David Bonalle, director of marketing and insight, estimates that the regional bank had between 15 and 20 analytics units buried within the various product and marketing groups, and 13 different databases. The bank has since consolidated all analytics into a single center of excellence that has full access to all data sets. And the organization reports to the CEO -- not to IT or marketing.
- People are just dying for the right diagosis: As many as 15% of medical cases may have been misdiagnosed by physicians, and this can lead to fatal consequences, says Edward Hoffer, associate clinical professor of medicine at Harvard. Hoffner has been working to improve DXplain, a database hosted at Massachusetts General Hospital that describes some 2,500 illnesses. But physicians may not be ready to reach for such a resource. When presented with challenging cases and asked to provide a diagnosis, 72% of doctors participating in the study were confident that they had the right diagnosis -- but they were correct only only 48% of the time. With harder cases the disparity only widened. Doctors correctly diagnosed the problem only 7% of the time, but their confidence level about their diagnosis only dropped a small amount, to 63%. "They were wrong almost all the time but were almost equally confident they were right," he says. So, rather than promote the current "pull" model, where the doctor needs to ask for help, MGH wants to try a push model. Detecting diagnostic errors is extremely difficult, so Hoffer is experimenting with an advanced text analytics project to help guide doctors. The system must automatically extract case findings, check to see if the doctor is off track, and proactively make suggestions. The challenge, he says, is that due to all of the jargon and abbreviations in medical data, "It's hard to understand what they are saying."
- How much "training data" is enough for a predictive model? Think flat. Keep adding data until the trend line flattens, says Michael Berry, analytics director for TripAdvisor. For example, Berry wanted to track when the first person from a country posted to TripAdvisor this year. The first 60 countries were represented in the first 1,000 user reviews. Then the pace slowed, with users from new countries spaced much farther apart. (Just in case you're curious, the first post this year on TripAdvisor came from the UK, while the last post from a new country, Chad, arrived after 7.6 million other reviews had been posted.)
- Big data is no big deal. "We figured out how to deal with it," and that's well trodden ground, says Berry. It's far more important to focus on whether your data is clean, making sure your samples are unbiased, and combining and transforming variables in ways that make more information available.