Using live data in development means you can test real workloads and get realistic results in transactions and reports. It’s also a significant security risk, as U.K. baby retailer Kiddicare recently found out: The company used real customer names, delivery addresses, email addresses and telephone numbers on a test site, only to have the data extracted and used to send phishing text messages to customers.
In 2015, Patreon CEO Jack Conte admitted the names, shipping addresses and email addresses for 2.3 million users of the crowdfunding site had been breached, also “via a debug version of our website that was visible to the public” that had a “development server that included a snapshot of our production database.” And earlier this year a developer at Sydney University in Australia lost a laptop containing an unencrypted copy of a database with the personal and medical details of 6,700 disabled students.
“We can point to incidents such as Kiddicare and Patreon to show the serious security ramifications of this,” says security expert Troy Hunt, who runs the Have I Been Pwned? site to help consumers find out if any of their accounts have been compromised. “There are industry precedents for just how bad this can go.”
“Then you consider the logistics of getting production data to test: Someone is connecting to both environments, perhaps that's a linked SQL Server in the test environment with access to the production data. I've seen this done before and it's a huge risk,” says Hunt. “The excuse you'll often hear from developers is ‘I need to reproduce a bug that only happens in production,’ but that points to a lack of error handling and logging on their part.”
Being able to simulate or virtualize data is not only safer, says Hunt, but it can be a productivity boost. “It's not just the security and code quality issues; generating test data in an automated fashion enables you to easily recreate the same environment for others on the team. In an ideal world, you simply fire up the data generation script and provision yourself a fully populated, non-production environment. Yes, it may be more work than a one-off copy of production but you only need to do this once and you're not faced with dealing with customer data outside production.”
Security and agility
That combination of data security and data agility is key if you’re going to shift your IT budget from maintaining your existing systems to innovation, says Daniel Graves, vice president of product management at data virtualization vendor Delphix. The Delphix software delivers virtual copies of data from databases like SQL Server, IBM DB2, Oracle Database and E-Business Suite, and soon MongoDB, without waiting for exports to run, or for you to go through a complicated runbook of manual processes to remediate the system.
“IT leaders want to go from quarterly releases of apps to monthly,” Graves says. “Websites want to move from daily to hourly releases. Our banking customers are going from releasing an update to their software once a year to every few weeks. This drive to increase speed and deliver more features more rapidly is coming from every industry — and not the ones you’d normally expect. Governments are doing this, healthcare organizations are doing this.”
Adopting continuous delivery and updating your customer-facing apps and services frequently doesn’t help speed up development if developers are waiting for access to the data those applications work with. “Managing tremendous amounts of data has been a key blocker to them,” Graves claims. “With DevOps tools they can automate their infrastructure and spin VMs up and down more quickly, but they can’t do that with data. If you want to spin up a dozen copies of your data, it takes weeks. Extracting data, moving it across the network and making physical copies is a slow, laborious, manual process. Delphix can take their data environment and allow them to manipulate it in minutes, not weeks, in a self-service fashion.”
Faster access to data can improve productivity and even software quality. Developers can run far more tests because it’s so much quicker to get a clean copy of the data for every test run. “Getting a test system ready to run can take a day; if you run one set of regression tests that take an hour then you have to reset the environment, and that takes another 16 or 18 hours,” Graves says. If it takes a day to run an hour’s worth of tests, you can only do seven tests a week. “If you can reset in minutes, now you can run 24 tests a day, and that means you’re finding bugs much earlier in your development cycle, which reduces the cost and complexity of fixing them.”
If you have multiple developers and QA teams needing to use the same data and needing a clean copy of it each time, you can give each of them their own sandbox. Online ticket marketplace StubHub, a Delphix customer, used to have seven copies of its data: three for developers, three for QA and one for beta testing. Now they have over a hundred copies. “That’s something you would never do in the physical world,” Graves notes. “When your database is dozens of terabytes in size, you are never going to buy enough storage or employ enough DBAs to manage 150 copies of your data. Once it’s virtual, you have a lightweight, instant and secure way to proliferate them without increasing the security risk.”
To do that, Delphix can also mask and tokenize data for security during development. Healthcare, for example, is subject to complex regulations protecting patient data and personal information. “You have to be very careful about using that information in your development process,” Graves says, and the same is true if you’re dealing with payment information. “In order to drive a new system to market they need to follow all these rules and regulations carefully. We can identify sensitive data by profiling the source database to find the names and addresses and other personally identifiable information, and then we use masking algorithms to create realistic looking but totally desensitized versions of those to protect the integrity of the application.”
Masking isn’t new (SQL Server 2016, for example, will let you set policy to automatically mask chosen fields in database reports and exports by role), but combining it with data virtualization that covers all your data sources is, says Graves. “The result is, say I’m in the QA team; now I’ve got self-service controls but the admin can set it up so that when I work on this application I always get the last month of the most recent data and it’s always masked. I don’t even have a choice.”
That matters because developers don’t always follow company policy on protecting data. “Data is being stolen out of QA and devtest,” Graves says. “Companies have done a great job of protecting data locations but that data is also on someone’s laptop in the coffee shop and attackers are going to go after the least defended part of your system.”
Securing offsite and hybrid cloud
Virtualizing data isn’t just for developers: It can help everyone from business analysts to the IT team, especially if you’re considering hybrid cloud.
“You can sync data into Delphix, mask the data then relocate it into cloud services like AWS so you can do devtest in the cloud and reporting in house,” says Graves. “It’s the same for analytics. You can sync a fully realistic set of data — and masking is an irreversible process. If I change Dan Graves into Steve Johnson, you can’t get it back so if it gets stolen it doesn’t matter. That allows you to move a significant amount of your workload into a cloud environment to reduce costs and enable bursting but without any change in the security, governance and control because of the masking.”
There are reversible options like tokenization, if you want to use data virtualization for disaster recovery and make sure you can get the original data back. “It’s about getting the right data in the right form to the right user, when they need it,” says Graves.
There are lot of advantages in data virtualization and masking, and there are plenty of incentives to start adopting it. Losing unencrypted test data, for example, is the kind of poorly managed process that could incur fines under the new European General Data Protection Regulation (GDPR). But if you’re using masking or other forms of pseudonymization you may not have to report breaches or respond to data access and data removal requests, or require consent for automated decision making and profiling.
"The GDPR introduces a carrot and stick approach to promoting data masking,” notes Phil Lee from the Privacy, Security and Information team at international law firm Fieldfisher. “It encourages businesses to adopt pseudonymization technologies, either as part of good information management or by reducing regulatory burdens in the event of unforeseen events, like security incidents. Contrasted against that, companies that are not in compliance with the GDPR face regulators waving a very big stick — potential fines of up to four percent of annual worldwide turnover.”
This story, "How data virtualization delivers on the DevOps promise" was originally published by CIO.