Data lakes: A better way to analyze customer data

Early adopters share their experiences

data lakes
Thinkstock

When The Weather Company wanted to up its game in the forecasting world, executives knew the answer was to analyze even more data. However, the company's data warehouse was too constricting; it accepted only structured data and required as long as six months to develop appropriate schemas.

"Our goal was to inject data into our businesses as fast as possible to be able to see new opportunities," says Bryson Koehler, executive vice president, CTO and CIO of The Weather Company. "It's not realistic for a business to go dark on a project for any extended period of time just to clean up data. So much changes on a daily basis -- so many new sources of data -- that that journey would never be complete."

Koehler wanted to bring in data from anywhere it originated, including personal weather stations and Internet of Things sensors, to enrich analysis. With traditional data warehouses, this would have been near impossible because of the unstructured nature of the new data, the volume, and the lengthy development time necessary to process and validate it.

"We get data from a lot of startups, and I can't ask these companies to create a specialized format for us," Koehler says. "They would go somewhere else that would take it [as is], and that would take away a competitive advantage."

To ward off that potential, two years ago The Weather Company became an early adopter of data lakes. This approach allows enterprises to ingest, analyze and store unstructured, semi-structured and structured data in an agnostic manner, providing a more flexible repository than traditional data warehouses.

Many of today's data lakes work with Apache's Hadoop open-source distributed framework to store and process data. EMC, HP Enterprise, IBM, Microsoft and Informatica are among the companies offering data lake platforms that operate with Hadoop. (IBM recently purchased some digital assets from The Weather Company.)

The Weather Company uses Amazon S3 for its data lake and Apache's Cassandra database and Apache Spark for processing real-time analytics, Koehler says. With the data lake strategy, The Weather Company can accept data from 135,000 independent personal weather stations run by hobbyists around the world. That information swirls in the lake with other critical data about events including lightning strikes and turbulence to provide weather insights for data scientists and business professionals alike.

Since starting the data lake project, The Weather Company has been able to improve temperature forecasts by two degrees of accuracy. "Two degrees is a lot in this world," Koehler says.

Where lakes work -- and don't

Oliver Halter, a partner in PwC's analytics practice, says the speed and velocity with which data is changing and data sources are becoming available will lead more companies to consider data lakes.

If companies try to perfectly integrate 15 data sources with data warehouses, in that same time, "another 50 sources become available that also are of value," Halter says.

Judith Hurwitz, president and CEO of consultancy Hurwitz & Associates, agrees that the two serve different purposes. "When you want to know everything about your competition and sales of blue shirts and what everyone on the Web is saying, then you can grab everything there is to know about blue shirts" and amass it in the data lake, she says.

"It's a not cost question" when it comes to warehouse or lake, she explains. "If you're really tying the data to a business decision -- where the data had better be clean and fully reliable -- you probably wouldn't use a data lake."

That's because data lakes do have the potential to fail if implemented for the wrong reason in the wrong way. "If you need to report your financials or do a [government] filing, then data has to be as clean as possible," and a data warehouse "is the appropriate solution," Halter says.

"When you are doing analytics on relatively raw, un-normalized data, the chance of misinterpretation or simply non-perfect data matching may not be accurate enough for precise, 'dollar accurate' financial statements or transactions," he explains. Data science in data lakes is more about trending analyses and directional correctness rather than providing precise results.

Halter suggests another way to think of lakes vs. warehouses. "If you need a set of factual numbers that need to tick and tie and hold up to scrutiny -- 'Our outstanding accounts receivables as of March 31st are 3,567,444,556' -- then you need the data warehouse approach," he says. "If you need to give directional guidance -- 'We think the market for XYZ will grow by 60% to 80%' or 'Customer XYZ has a 35% higher likelihood to buy product A vs. B' -- then the data lake approach can work."

In most cases, an organization would have both data warehouses and data lakes. "Once you discover something in the data lake of value to the organization that you want to repeat, then it can be shifted to the data warehouse to be normalized and harmonized," Halter says.

1 2 Page 1
Page 1 of 2
Bing’s AI chatbot came to work for me. I had to fire it.
Shop Tech Products at Amazon