Why more data isn't always better

For years, the focus of data has been on amassing and storing, but more data isn't always better. Here's why.

satellite

Artist concept of the Global Precipitation Measurement (GPM) Core Observatory satellite.

Credit: NASA's Goddard Space Flight Center

In the past 10 years, the focus of data has been on amassing and storing: the more data collected, the better. But while we all became expert data gatherers, what we actually ended up with was a glut of data, a shred of the insights we expected to get, and a very expensive problem that I call Messy Plumbing Syndrome.

Data scientists -- the very people who are passionate about interpreting data -- are doing less analyzing and more cleaning that of messy plumbing. In fact, 80 percent of their time is spent struggling with inefficient cleaning processes they must complete to make data usable. You can call it "data wrangling" or "data janitor work," but it's both incredibly time-consuming and a huge factor in preventing organizations from cashing in on the promise of big data.

Companies simply can’t afford to continue on this path. They need to pull themselves out of the mire of messy plumbing.

The first step is to refocus their big data lens into rich data clarity – the hidden bounty that’s shrouded within your data warehouse. Gathering and storing data are key, but for companies to truly understand and embrace what rich data can do for them, the big data conversation has to shift completely.

Here are three reasons why:

1. Big data consolidates information. Rich data drives actual growth.

Consider the case of customer databases. Every time a customer downloads a whitepaper or signs up for a newsletter from a B2B company, their activities are being recorded in a massive CRM system somewhere. But any time customer information is added, a data mess results. Does the customer record say YouTube or Google or Google, Inc? Did the customer enter “California” or “CA” as their location? These are detailed data nuances that computers don’t resolve. Imagine how much more effective businesses could be in driving customer retention and achieving revenue goals if their customer database were full of rich data, fully deduped, complete and accurate.

2. Big data gathers the big picture. Rich data makes it meaningful.

Skybox Imaging launches low-cost satellites into orbit. They take pictures of every spot on the globe each day, images full of rich economic information. Owning massive databases with trillions of pixels capturing the entire world is one thing, but what might be even harder than launching those satellites and storing those terabytes is figuring out what’s actually in those images.

One way the data scientists are using this information involves building algorithms that detect the amount of oil in Saudi Arabian oil drums. Those, in turn, can be used to predict future gas prices. Data scientists can't afford to sit there for hours upon hours marking where items -- in this case, oil drums -- are in images to train their algorithms. Big data is all those countless pictures full of amorphous shapes; rich data is knowing the precise number of oil drums in every image. Once they know that, the data can be analyzed to ultimately determine gas prices months in advance. Coming to that conclusion can be transformative.

3. Big data quantifies the world. Rich data changes it.

This is a pretty bold statement, but just look at the health industry. While companies can access hundreds of thousands of anonymized patient records, suppose they actually want to figure out if a new cancer treatment is effective. Big data is those thousands upon thousands of records with different codings and different date formats and doctors notes written in text; rich data reveals who received what treatment and who got better from it. Rich data helps change the status quo of medicine by informing ongoing research, development and innovation in medical research.  

When it comes to big data, pretty visualizations aren’t enough. An ugly visualization on rich data is far more useful than a beautiful visualization of messy and incomplete data. Companies that are serious about rich data should look to open source tools like OpenRefine, which enable data scientists to create semi-automated process to clean, enrich and de-duplicate data sets. Tools such as MuleSoft, IFTTT and Zapier are also starting to make it easier to import large sets of disparate data sources into the same place. In other words, we’ve got the medicine we need to cure ourselves of Messy Plumbing Syndrome; we just need to use it.

Our ability to gather and store data is rapidly outpacing our ability to make sense of it. Companies that choose to invest in the tools, people, and processes that turn big data into rich data are the ones that will come out ahead.

This article is published as part of the IDG Contributor Network. Want to Join?

To express your thoughts on Computerworld content, visit Computerworld's Facebook page, LinkedIn page and Twitter stream.
Windows 10 annoyances and solutions
Shop Tech Products at Amazon
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.