Data Lake, huh, What is it good for, Absolutely everything!

Forgive me! That was me trying to be funny by using lyric from Edwin starr’s “War” song for the title. <link to YouTube>

I digress! This past week I attended Data Science North Carolina Conference 2019 in Raleigh getting caught up on latest concepts in Data Science. One of the things I was discussing with a group of data scientists was Data Swamp vs. Data Lake vs. Data Warehouse. I figured I would write a short post about it.

Organizations are starting to leverage new technologies to meet their ever-evolving business needs. They are generating more information at a faster rate than ever before. This comes with the need to record and store all that information, as well as, develop meaningful, actionable analytics from all that recorded data.  

In short, data lake is a centralized repository where an organization stores anything and everything, from structured data( i.e., relational data generated by its operational DB’s and transactional systems) to unstructured data(i.e., non-relational data from IoT devices, social media..etc.). The data is stored as it comes in without any format manipulation, and little to no curation gets performed. 

An organization can benefit significantly by developing analytics like machine learning on these data lakes to identify, rectify issues or act upon new opportunities to grow their business.

The definition of a data warehouse can vary depending on the person. According to Bill Inmon, A data warehouse is a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of management’s decision-making process. And according to Ralph Kimball, A data warehouse is a copy of transaction data specifically structured for query and analysis. A data warehouse is developed with careful curation and usually have a specific function. It provides an organization with the ability to answer obvious questions with unknown answers. With one purpose, to make structured data readily available to its users. 

While data lake does provide an advantage over traditional data warehouse by providing with the ability to store massive amounts of data without the need for structure. Proper care has to be taken, so that required governance, metadata cataloging, and access regulations are set in place ( see data swamp below). 

Data mining on the data lake, to derive new types of analytics like machine learning is typically restricted to qualified professionals ( i.e., data scientists or data engineers). So all downstream applications ( i.e., BI reports, dashboards,etc.) that are developed from the data lake can be trusted by the wider/general audience.

Now onto Data Lake vs. Data Swamp

A data swamp is what can become of your organization’s data lake over time with poor data governance, no predefined mechanisms to catalog data, storing of extraneous information, and with lack of meaningful cleanup processes. 

Data swamp does not provide any value and any effort to data mine it usually ends up in frustration, as well as the derived information being unusable. Whereas data lakes, as noted above, make it easy for Data Scientists to mine and analyze data. They require minimal transformation if any, to facilitate automated pattern identification, and is an excellent online archive.

How to Choose?

Each organizations requirement is different and unique. It’s no longer simple as deciding to develop either a data warehouse or data lake. but, an organization will most likely require both data warehouse and a data lake as they both serve a different purpose.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s