If you look at a data science job posting, chances are it will ask for experience with machine-learning techniques, statistical programming languages, nosql databases and maybe some visualization tools. If you look at the curriculum of the new data scientist “boot camps,” the material will be the same. But if you look at what a data scientist actually does, it’s cleaning and collecting data. This isn’t because data scientists are misguided. Quite the opposite, in fact. It’s because they know that cleaning and collecting data is actually the most important thing for successful data applications.
One great example is Google Translate. Nowadays, Google has some of the best scientists who study natural language translation. But when Google Translate launched, it was immediately as good or better than many translation products that had been created over decades. And it wasn’t due to state-of-the-art algorithms. Rather, it was because Google had a corpus of data that was bigger than anyone else had access to — the entire Web that it crawled in order to make Google search. While Google hires the best data scientists and talks a lot about its amazing algorithms, if you ask the researchers who work there, they will tell you that a lot of their success is due to a massive brute-force effort to make high-quality data available everywhere.
When I worked on a translation system at the Stanford A.I. lab, we usually trained our models on the biggest corpus we could find: the European Parliament Proceedings. The reason is because these meetings are conveniently hand-translated into every language in the EU. I worked hard to make my algorithm handle words with multiple meanings, but it always translated “cabinet” as though it was the political group rather than the piece of furniture. No matter how sophisticated my algorithm, it had no chance of discovering that “cabinet” could mean furniture, because in the data it was trained on, “cabinet” was almost always political.
Google didn’t have this problem. It had millions of websites hand-translated into various languages, and with some smart cleanup, its model beat out millions of person-years of research.
Here’s a typical graph from an academic paper titled “Active Semi-Supervised Learning for Improving Word Alignment.” It’s plotting the error rate of several different training techniques for “word alignment” and, ostensibly, the paper is showing that the fancier algorithm, in this case “Posterior,” outperforms the more basic technique, “Random.” And indeed, the better algorithms did outperform the simpler algorithm enough to get this paper published. But the dominant effect that is not even mentioned is that as training data is added, all the algorithms get much, much better. You can use fancier algorithms and you’ll be a little more efficient, but if you want to be sure that your machine learning will work well in the real world, the best thing you can do is give it lots of training data.
Why don’t we talk more about this? I'st not the sexiest idea, but it really, really matters in the real world. Researchers often don’t feel empowered to collect their own large-scale data sets, but if you want your researchers to be successful you should empower them!
This is also why I believe so strongly in open data. As machine learning becomes an integral part of every business, the companies with the most data are going to have a massive and unfair advantage. If we can create open datasets that are as widespread as open-source software, we can give smaller upstarts the chance to compete. But in the meantime, if you are a data scientist or work with data scientists, there’s a simple thing you can do to get that advantage: collect more data.