Everyone I spoke to at the Informatica World 2015 last month wants to build a Data Lake, but are afraid that the Lake will end up as a Swamp. Based on where we are heading, I don’t think using words like Lakes, Swamps, Reservoirs, or names of other large data bodies is going to work well. It helps with visualizing the solution but upon further discussion ends up creating more confusion.
So, let’s take a fresh look at things.
It’s a Goodwill store
At a Goodwill store, you know that the stuff on display is gently used. In the store, you will find cloths, art, books, musical instruments, electronics, etc. Similarly, in the case of data that has delivered its value to its source system doesn’t need to be exclusive anymore.
In other words, it’s time for the data to go work for the community.
Expect the best, plan for the worst
Not everything you donate will be deemed fit for sale. So, apply some basic rules to your data. What are the rules? For this, you need metadata. Each entry in the Lake must be able to answer the following questions: why, who, what, when, where, and how?
Now, don’t get started with building logic to create metadata in the Lake, it’s unnecessary. The upstream system that supplies the data incorporates this information during delivery.
Data must die
Have you ever wondered why financial websites operate publically with a 15-minute delay? It’s because the data is already historical and can’t be used to establish trades; it’s already old. There is nothing wrong with using all the data available to process. The question is – Will you gain anything from it?
Why should the Lake be any different? The source system should set a date using a metadata attribute that tells the Lake when the data can be discarded. It’s up to the Lake to discard the data based on its demand.
Rate your data
Data comes in a wide range of flavors from raw to master-ed. The challenge is to be able to measure it on the same scale. Most users are more interested in relevance than the quantity of data.
Most Rating algorithms only represents 1-dimension, i.e., the value as perceived by the source. One strategy could be to collect all these ratings from source and users and average them, which still keeps it 1-dimensional. The other strategy could to track the ratings from sources only. A user also becomes a supplier when he data posts the report (data) back to the Lake with his rating. Once a relationship between the report and the source data is established, the rating becomes 2-dimensional.
Now, you can imagine the amount of data and metadata that will flow into the Lake, although it doesn’t compare with the volume of data flowing through Twitter, but for your organization – it’s sizable.
Get your badge
I hesitate, but this is important. The Lake is for everyone but not for “everyone.” What I mean by this is – apart from aging and rating the data, you must consider securing it. Obviously, you have planned for a perimeter defense and various militarized zones to access the data, but what about the content in the data itself? Defense agencies do this all the time they implement protection at the very data element level.
Your business however may not be able to afford this as it eats into performance. You can however implement peer-to-peer data visibility. Think of this a Scout Badge, if the user has the supplier’s badge, he can view all of the data or none at all.
In summary, don’t get hung-up with the terms you use that makes it detrimental for growth and averse to change. Call it by whatever name as long as you are getting what you want. Sometimes, data from a Swamp may be what your business needs than “clean data coming out of a Faucet.”