After I wrote my last post, "You're hiring the wrong data scientists," I got a few questions. The one that came up the most? Exactly how do you hire a data engineer? So I thought I'd share a bit of my experiences building a team of data engineers. After all, if they're the ones that build the foundation your data scientists will work on, you really need to make sure the work they put in is great.
For starters, realize there's a real difference between the interests and skill sets that data engineers and data scientists have. On the surface, they often look the same. Both can write code, both understand data. But fundamentally good data scientists like to explore and good data engineers like to build. They both might understand the work the other does but that doesn't mean they'll do it as well as needs be.
Think of it this way: an architect understands the theory and practice of building and designing a house, but would you want one on the construction team? Likewise, would you want a foreman designing the floor plan of your house?
Let's look a real-world problem that will probably sound familiar to ones you've had in your companies: How do you tell if a sales lead is any good? That's an important job that data scientists are often able to answer, using data to train an algorithm that will predict which leads will convert in your funnel. The question is, where do those data points actually come from? Enter the data engineer.
A great data engineer can build the database that can answers this question. If your company is anything like mine, the critical data to determine if a lead is good sits in disparate places. Maybe a user downloaded a white paper your marketing team wrote. That’s likely recorded in Hubspot, Marketo or something similar. That lead might create a Salesforce record, too.
But what about the other touch points along the way? What if that lead writes into customer support? Probably that data is in Zendesk, or your support desk tool. If your application checks their NPS? That critical data goes into Wootric, or whatever you use to track customer happiness. For my company, CrowdFlower, one of the crucial ways we know a lead is good is, of course, is if they're using the product. But that data is in a completely custom Postgres database. And these are the easy data sources: Once you start wanting to integrate click data from your data warehouse you’re going to have a whole new set of issues.
If you want your data scientists to build a good lead-scoring algorithm, they need to have access to all of this data. A data engineer should be responsible for building a reliable system that captures and combines every important piece of information from disparate sources.
So now that we've defined the role a little more, what kind of skills should you be looking for? Obviously, you'll want someone who's a cultural fit on your data team, especially since your data scientists and data engineers' work will directly inform each other. But on a skills level, what I find works is actually looking for someone who's a bit more of a generalist backend programmer. You want someone with the ability to integrate with diverse APIs, and understand multiple languages well enough to work in them (though, of all the languages, Python is probably the most common and important). Data engineers who are well versed in Matlab or R, or one of the other main languages your data scientists use, are doubly valuable. If you are dealing with billions of records or more, you need someone who is familiar with distributed storage and processing tools like Hadoop or Spark.
Even if you don't have data scientists on staff, this would be a great exercise for your data engineers. Your sales folks will know all the times a lead has logged in, downloaded content, filed a help ticket and so on. They'll be able to speak with those leads eloquently and, more importantly, without worrying something's fallen through the cracks. Add a data scientist to the mix and that database becomes even more vital.
As for where I've found my best data engineers, that's another story. I've hired some great ones from job posts on Hacker News and AngelList, as well as through word of mouth and straight out of school. You probably have a few of your own tricks. But the important part remains: You need smart data engineers to make your data science team work. Hire intelligent engineers who love data and engineering, take time to articulate the challenges your organization faces and let them build. Just leave the exploring for after the foundation's up. Your data scientists will thank you for it.