We need open data to become the new open source

Data is becoming more and more critical to businesses, but almost all data is siloed inside corporations. The lack of open data sets today holds innovation back and that needs to change.

The open source movement is one of the most powerful forces pushing technology forward. It’s easy to forget that not long ago, startups had to raise tons of money from VCs to license Oracle or a web server. Today any tiny startup has access to the best tools in the world. 

But the lack of open data still seriously holds innovation back, and as data becomes more critical, the problem becomes worse.

For example, think about how hard it is for innovative predictive analytics companies to get off the ground. It's not that they don't have the software; it's that they don't have the data. There are plenty of excellent open source projects to build on top of (Sci-Py, R, etc.). But the lack of usable data is a huge issue when it comes to testing and training the algorithms in any domain.

The same exact thing would be true when an entrepreneur starts an e-commerce company. A high quality search engine is crucial in e-commerce and there plenty of great tools to build the search infrastructure such as Lucene, but no good datasets to test and train the ranking and relevance algorithms.

Which is to say this: There are smart, creative data scientists out there who don’t have the tools to do valuable work. Remember when Netflix ran a contest to beat it’s own movie rating algorithm? Tens of thousands of solutions were submitted, all based off a single data set of about 100,000,000 rows and Netflix eventually awarded the million dollar prize to a team of data scientists who beat their algorithm by more than 10%.

Even now, a half decade since the prize was awarded, that Netflix data set is constantly used in computer science research -- over 3,000 papers mention it. And almost all of the papers that mention it were written after the contest ended. It’s not that movie data is so important to computer science research -- there just aren’t many good quality datasets available. The contest wasn’t the important thing -- releasing the data was the real value to the world.

Graph of papers mentioning the Netflix prize Lukas Biewald


Personally, I know I would’ve appreciated having access to the Netflix data in college or at my first startup. There simply weren’t many datasets out there to do real-world work with. In fact, our research was often based on the datasets that happened to be available, and they were often toy-sized and only marginally relevant to the real world.

And the reality is, it’s still difficult for students and academics to do any research on big data, even now, since most of those big datasets are locked up inside company data warehouses.

More open data -- especially large, robust datasets -- solves all these problems. It helps startups train algorithms and iterate quickly. It helps academics get to the bottom of important issues like cyberbullying or disaster response times. It makes creating truly great, data-driven software much, much easier.

It’s exciting to see our government start to take releasing data seriously with projects like data.gov. Amazon hosts for free a number of interesting public datasets. And universities like UC Irvine have released valuable datasets in their Machine Learning Repository. Startups like Enigma.io have appeared to help companies make use of public data.  

What we really need, though, is companies to start sharing commercial datasets like we see with open source software projects. Once a few companies start to do this, and they see the benefits it will inspire, other companies and kick off a massive increase in the rate of innovation.


Copyright © 2015 IDG Communications, Inc.

It’s time to break the ChatGPT habit
Shop Tech Products at Amazon