Are datasets truly anonymized? Two well-suited researchers are going to find out

statistics stats big data analytics
Credit: Thinkstock

The researchers hope to develop privacy safeguards for very large datasets used in research. Vitaly Shmatikov is one of the researchers -- yes, the same Vitaly Shmatikov that was part of the team that successfully de-anonymized Netflix's customer data the company provided in a 2006 contest promotion.

With little fanfare or formality, Adam Smith, associate professor of computer science and engineering in Penn State's School of Electrical Engineering and Computer Science, and Vitaly Shmatikov, a professor at Cornell University, are going to try to tackle a looming issue that will, if it is not addressed, have consequences for just about anyone who has every used the Internet, sent an email, received medical attention or otherwise made his or her presence known on the Grid that is our online society.

That issue is the leak of information -- or inputs as they are called in the researchers' world -- from deep learning systems that could be used to identify by name the anonymous subjects.

Here's the kicker: Smith and Shmatikov are moving forward with this research with a grant from Google, the company that helped to put deep learning on the computing map with its research in the first place. In other circles, Google is also known as the company that has played a not-so-small role in making online privacy, or lack thereof, a growing concern.

Google bestowed the grant -- the amount of which was not reported -- under its Faculty Research Awards program, which gives one-year awards structured as unrestricted gifts to universities to support research in a range of subjects that might benefit from collaboration with Google, according to Penn State.

Of course, deep learning, as Smith himself said, is already here, used to recognize speech and images and perform such feats as steering cars through traffic without a human at the helm.

And it has been proven that privacy is easily compromised in the presence of such sophisticated computing (more on that in a minute).

But the rapid development of very large datasets increasingly being tapped by researchers and others for various projects is making Smith's and Shmatikov's research project particularly timely. These datasets cover just about every topic from mental health users in the UK  to information collected by the National Aeronautics and Space Administration.

Even if you have been living off the grid, you are probably in one of them, or will be in the future.

Deep learning and what it can do

The datasets are so huge that it would seem that pulling out one identifiable person is nearly impossible. But that is where deep learning comes in.

Briefly, deep learning refers to a wide class of machine learning techniques and architectures. The end goal is to have computers recognize objects or other items of interest via a trial-and-error process of learning to determine patterns or otherwise come to certain conclusions. Machine learning has been improving significantly in recent years in part because of the existence of these massive datasets -- the more data against which they learn, the more accurate they become.

Vendors such as Nvidia, Microsoft and Amazon Web Services -- to name just three examples -- have been releasing products in this category. Few, though, have focused on protecting the privacy of the "inputs."

Privacy safeguards that are easily circumvented

To be sure, researchers and the entities from which they receive these datasets are adamant in their promises that the data has been stripped of identifying features before it is handed over to researchers.

Which is all well and good. But now, let's take a short walk down memory lane.

In 2006, Netflix wanted to improve its movie recommendation system -- and perhaps get a bit of viral marketing going -- by launching a contest.

It challenged researchers around the world to build a recommendation system that could beat its existing one by at least 10 percent. The prize would be $1 million to the team that could exceed that benchmark by the widest margin.

The participants were given access to a training data set that consisted of more than 100 million ratings from over 480,000 randomly-chosen, anonymous customers.

Forty thousand research teams entered Netflix's content, including a team from the University of Texas -- and if you don't remember how this particularly story line ends I am sure you can guess.

Arvind Narayanan and Vitaly Shmatikov, researchers at the University of Texas at Austin -- yes, that would be the same Shmatikov who was co-awarded the Google grant -- were able to de-anonymize some of the Netflix data, according to a blog post by Bruce Schneier back in the day. They went on to publish it, much to Netflix's chagrin, which had promised -- promised! -- users that their personal information had been protected. There was also the not-so-small matter of the Video Privacy Protection Act.

Felix T. Wu, an associate professor at the Benjamin N. Cardoza School of Law,  included a retrospective on the event in a far larger essay on online privacy that was published in the school's Law Review. 

He wrote that:

The Texas researchers showed that despite the modifications made to the released data, a relatively small amount of information about an individual’s movie rentals and preference was enough to single out that person's complete record in the data set. In other words, someone who knew a little about a particular person’s movie watching habits, such as might be revealed in an informal gathering or at the office,  could use   that information to determine the rest of that person’s movie watching history, perhaps including movies that the person did not want others to know that he or she watched.

Narayanan and Shmatikov also showed that sometimes the necessary initial information could be gleaned from publicly available sources, such as ratings on the Internet Movie Database.

From a distance the episode is almost comic, although at the time it definitely was not. Netflix was subject to a class action suit brought by a gay woman outed by the research who had not disclosed her sexual orientation. That suit was eventually settled.

A collaborative learning approach

With this new Google-funded research project Smith and Shmatikov hope to prevent such mishaps from happening again, Shmatikov tells me. "Adam and I plan to develop new approaches to deep learning that will protect privacy of individual users' data, while still enabling all the amazing services that deep learning has made possible," he says.

"Among the techniques we plan to explore are collaborative learning, where -- instead of pooling the training data like they do now -- participants keep their data private and train independently but exchange a little bit of information during their respective training, and differential privacy, which is a mathematical model of privacy that was co-invented by Adam."

Smith, he adds, "is a top-notch theoretician who made seminal contributions to the science of data privacy, while my research focuses on understanding real-world privacy risks and building practical privacy-preserving systems."

"This is a good combination of skills that we hope will lead to substantial progress in privacy-preserving machine learning."

Playing with fire

Yes, indeed. Hopefully they will produce a working system relatively soon as the temptation by users of these datasets to give privacy concerns the short shrift is very great.

After its first debacle with the contest, Netflix actually decided to launch a second contest.

After an inquiry from the Federal Trade Commission, it changed its mind.

This article is published as part of the IDG Contributor Network. Want to Join?

Call on line 2! Six ways to add a second line to your smartphone
View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies