Are you sharing more data with Google than you have to?

A new approach to limit how much of your data you need to share is being offered it to companies for free

Big data
Credit: luckey_sun via Flickr

Whether your concerns are privacy, security, competitive advantage, intellectual property or risk avoidance, your enterprise needs to be sharing—literally—as little data as possible with employees, contractors and third parties. As obvious as that statement is, it’s stunning how much data is unnecessarily shared with cloud providers and others.

There are two reasons for this. First, the time and effort needed to be remove data that the third party doesn’t truly need from the data that is needed can make the ROI seem unattractive. This is especially true when executives play down the risk of anything bad happening.

As in "I’m probably safe trusting Google/Microsoft/Amazon/Rackspace, etc." Really? Even if you choose to assume that their security is stellar—it isn’t—what about competitive issues? Are you really willing to trust that they will handle your data with your best interests at heart?

The second reason is more practical: technological limitations. The way many enterprises handle data—especially data that is either created by or managed by mobile devices—makes it truly difficult to easily separate the critical from the non-essential.

Limited data sharing and encryption

Researchers at the Swiss Federal Institute of Technology in Lausanne—officially the École Polytechnique Fédérale De Lausanne (EPFL)—may have come up with a way to deal with both issues. Their approach limits what data is shared and uses an encryption approach that allows data to be crunched while still encrypted.

The approach they are proposing is designed to deal with a very limited issue: privacy and security issues involving ride-sharing services such as Uber and Lyft. But its creators see the same approach applying to a wide range of cloud, big data and other third-party services that enterprises deal with every day—when they are typically sharing far more information than they need and want to.

Italo Dacosta, an EPFL postdoctoral researcher involved in the project, cited hospitals that "in the context of personalized medicine, want to do computations on the DNA sequence" and seek a cloud firm to help with the complex number-crunching. "Patients may not be comfortable sharing the DNA sequence because it’s so sensitive," he said in a Skype interview with Computerworld.

"Homomorphic encryption patients will not have to reveal their DNA sequence at all, not even partially," Dacosta said. "The main use case for homomorphic encryption for personalized medicine is allowing researchers/doctors from other hospitals/medical institutions to analyze genomic data without having to reveal the data to them. They only see the results of their queries and analysis."

The third parties "never see the real data, but you get the results from the computations. [Third parties] don’t need to see the data [as they] can crunch the data while it’s encrypted."

The researchers are publishing their source code and full implementation details in the hope that companies will adopt the approach. They deliberately have avoided patenting the approach, preferring companies to use it for free, Dacosta said.

Somewhat-Homomorphic Encryption (SHE)

The approach, detailed in this paper, involves Somewhat-Homomorphic Encryption (SHE). (Note: Stanford University has published a short description of SHE.)

This excerpt from that paper gives the overview of the technical approach:

"SHE cryptosystems present semantic security, i.e., it is not (computationally) possible to know if two different encryptions conceal the same plaintext. Therefore, it is possible for a party without the private key to operate on the ciphertexts produced by riders and drivers, without obtaining any information about the plaintext values. Additionally, we choose one of the most recent and efficient SHE schemes based on ideal lattices, the FV scheme. This scheme relies on the hardness of the Ring Learning with Errors (RLWE) problem. Note that whenever working with cryptosystems based on finite rings, we usually work with integer numbers, hence, from here on, we will assume that all inputs are adequately quantized as integers.

"When a rider wants to make a ride request, she generates an ephemeral FV public/private key-pair together with a relinearization key. She uses the public key to encrypt her planar coordinates and obtains their encrypted forms. She then informs the [service provider] about the zone of her pick-up location, the public and relinearization keys and her encrypted planar coordinates. When this information arrives at the [service provider], the [service provider] broadcasts the public key to all drivers available in that zone. Each driver uses the public key to encrypt their planar coordinates and sends them to the SP. The SP computes, based on their encrypted coordinates, the encrypted distances between the rider and the drivers, and it returns the encrypted distances to the rider, from which the rider can decrypt and select the best match, e.g., the driver who is the closest to her pick-up location."

This approach was crafted with a mobile network in mind, although there is nothing about the SHE implementation that wouldn't work in a non-mobile environment. But the paper did acknowledge what IT has known for years, which is that mobile devices are impressively leaky from a data perspective.

The researchers tried to sidestep mobile data-leaking problems.

"We assume that the metadata of the network and lower communication layers cannot be used to identify riders and drivers or to link their activities. Such an assumption is reasonable because, in most cases, the smartphones of drivers and riders do not have fixed public IP addresses [since] they access the Internet via a NAT gateway offered by their cellular provider. If needed, a VPN proxy or Tor could be used to hide network identifiers," the paper said. "Moreover, drivers use a navigation app that does not leak their locations to the [service provider]. This can be done by using a third party navigation/traffic app—e.g., Google Maps, TomTom, Garmin—or pre-fetching the map of their operating areas—e.g., a city—and using the navigation app in off-line mode."

Some drawbacks to the system

Still, even for its intended ride-hailing approach, their system has its drawbacks, the paper said.

"The evaluation of [the service] by using real data-sets from NYC taxi cabs shows that, even with strong bitsecurity of more than 112 bits, ORide introduces acceptable computational and bandwidth costs for riders, drivers and the [service provider]. For example, for each ride request, a rider needs to download only one ciphertext of size 186 KB with a computational overhead of less than ten milliseconds. ORide also provides large anonymity sets for riders at the cost of acceptable bandwidth requirements for the drivers: e.g., for rides in the boroughs of Queens and Bronx, a ride would have an anonymity set of about 26,000, and the drivers are only required to have a data-connection speed of less than 2 Mbps. Moreover, our results show that ORide is scalable, as we considered a request load that is significantly higher than the one in current RHSs, e.g., Uber accounts for only 15% of the ride pick-up requests in NYC," the researchers wrote.

But "PrivateRide’s usability is reduced [compared with] current [car services] because the supported payment mechanism is less convenient. [Their approach] requires payments with e-cash bought in advance before a ride. Moreover, ride-matching is suboptimal, because the distance between rider and drivers is estimated using the centers of the cloaked areas, instead of exact locations, resulting in additional waiting time for riders."

Those drawbacks, though, seem limited to a car-sharing service. It wouldn't likely have much of an impact on typical big data outsourced enterprise efforts.

I recently talked with a senior executive at a very large cloud hosting company who described how a government agency recently asked for help with a very large data analytics project. How large? The executive originally estimated that they would need 100 servers to run the analytics and they ended up using almost 2,000 servers. Yes, sometimes big data gets very big.

That's the point. Any time you outsource data, you are taking a massive risk. Will the data be well-protected? By the way, who actually gets access? You need not merely trust the employees of that third-party, but any of the third party's contractors that have access. Is someone sanitizing backups? Heck, is this third party's data being backed up by yet another third party?

How far down that rabbit hole do you want your data to go? Want to get a call one day from a Secret Service agent informing you that your data was found in the files of a company you've never heard of? It might be an unauthorized access, but the odds are decent that it could be an authorized one. By outsourcing your data, you are also outsourcing control. How trusting are you?

This Swiss approach won't solve that problem. But if it provides a way to reduce your risk — and did I say it's being offered to companies for free? — it might be very worth exploring.

Computerworld's IT Salary Survey 2017 results
Shop Tech Products at Amazon