My company, CrowdFlower, recently conducted a survey of data scientists. One of the things we asked was what they spend their time working on and what they like and don’t like doing. Incredibly, cleaning and wrangling data, the thing they dislike the most, is the thing they spend the most time doing. The New York times has reported on this phenomenon and Michael Driscoll recently articulated the same concern in an article in Ad Age, saying “The best minds of my generation are deleting commas from log files, and that makes me sad.“
In other words, this is a very real problem. Data scientists are some of the most in-demand employees out there, and it’s incredible in the truest sense of the world that companies are using them to do things that they a) don’t like and b) aren’t trained to do well.
By way of analogy: Say you wanted to open a world class hospital. You decide the way to do this is to get the best surgeons around. You hire dozens of them, foot their salary and announce to each of them that your hospital is going to be the best on earth because you have the most impressive collection of surgeons around.
Instead of hiring janitors and nurses, your surgeons are doing everything. They're answering the phone and scheduling appointments. They're organizing outpatient programs and ordering latex gloves. They're smart, sure, but they’re not trained to do what you’ve hired them to do. You're wasting their expertise. Your hospital is going to fail.
This is how a lot of companies handle hiring a data team. They understand that data scientists are important, so they hire a bunch of them. But what they don't realize is that data scientists, in addition to being fairly expensive and always in-demand, need support to be really successful. Just like surgeons need assistants and a capable staff, data scientists need data engineers.
Smart organizations know that data improves decisions on every level and many have been storing as much of it as they can for years. But this data is messy. It's stored in unstructured databases, in third-party applications or purchased from data sellers who store and sort their data differently. This means that while a company might have massive amounts of data, they can't really unlock its value without a high-quality data team.
Storing that data was a smart choice. Trying to decipher it is a smart choice. The problem I see is that they aren't hiring the right people.
What companies need to do is hire data engineers. For starters, this is a position that's easier to fill and you should have more data engineers than data scientists, especially if you're looking to build out a data team from the ground up. The reason is because data engineers are the ones that take the messy data I described above and build the infrastructure for real, tangible analysis. They run ETL software, marry data sets, enrich and clean all that data that companies have been storing for years. They set the foundation.
What a data scientist does, or rather, what most data scientists want to do and what they excel at doing, is running regressions, making analyses, creating and tweaking models, spotting trends, making predictions. But what they end up doing, in many, many organizations, is cleaning data. That’s just a waste.
So while more companies are smartly embracing data-driven decision-making, they're still hiring the wrong people. Look for data engineers first and data scientists second. You'll save money, get much better analysis, and have a happier data team to show for it.
This article is published as part of the IDG Contributor Network. Want to Join?