“Big Data”, “Data Lake”, “Data Swamp”, “Hybrid Architecture”, “NoSQL”, “Hadoop” … terms you are confronted with very often these days when you are dealing with data. Furthermore, the question comes up if you really need a data warehouse nowadays when you deal with a high variety and volume of data. We want to talk about what a data lake is, if we need a data warehouse when using NoSQL platforms like Hadoop, and how it is combined with Data Vault.
WHAT IS A DATA LAKE?
There is a proper definition from Tamara Dull (SAS): “A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.” 1
The last sentence, that the data structure and requirements are not defined until the data is needed, says that the structure is mapped when the data is queried from the data repository (data lake), known as “schema on read”. The difference to traditional RDBMS (schema on write) is, that you don’t pre-define the structure of the data (files) itself when loading the data into the NoSQL database. This does not say, that structure is not necessary – on the contrary, the structure is very important and required to utilize a data lake. This is not defined directly in the data lake (NoSQL database), but when the location of the files is attached to a schema, executed when the data is read. If no structure is defined and you are using a NoSQL database like Hadoop – which is just a file storage – as a “landing zone”, it will become a data dump. Then you will come into the situation, that your data scientists or power users have to do a high effort when going to the dump and try to find things of value in it.
For example, with tools like Hive you can declare your query schema, and this by using MPP (Massive Parallel Processing) on a HDFS.
DO WE STILL NEED A DATA WAREHOUSE?
When you have a data lake with all your data, you might ask yourself the question if you need a data warehouse … or do you already have a data warehouse when having a data lake?
First, we have to compare the terms data lake and data warehouse. A data warehouse stores its data subject-oriented, time variant and integrated by business key. A data lake is not subject oriented when ingested, not integrated at all and cannot handle CDC (Change Data Capture) or deltas, because you can not update the content of files (you just add or replace an existing file). From the technology and capability perspective, there are also differences/changes by using a traditional RDBMS (for data warehouses) and New SQL / NoSQL platforms (for data lakes).
|Traditional RDBMS||vs.||New SQL / NoSQL|
|structured, processed||DATA||structured / semi-structured / unstructured, raw|
|expensive for large data volumes||STORAGE||designed for low-cost storage|
|less agile, fixed configuration||AGILITY||highly agile, configure and reconfigure as needed|
|business professionals||USERS||data scientists et. al.|
Figure 1: Traditional RDBMS vs. New SQL / NoSQL 2
This short comparison makes clear, that a data lake is not a data warehouse. But, NoSQL and RDBMS can live happily together and should be designed to co-exist. When using a data lake only, some lacks are coming up:
|Governance||NoSQL (today) allows a free-for-all. If you can log in to Hadoop, you can see everything that exists there.|
|Definition||Everyone call it a data lake, but data lakes are actually stratified (designed, managed, governed and defined).|
|Business Value||Think on ontologies, taxonomies, tags, terms, and business keys. More data is just that - more data. The question should be: Where is the business value hidden?|
|Structural attachment||We can't understand what we don't identify” (think on architecture and design). Turning raw data / data dumps in to stratified data lakes. Physical data modeling is less important, business modeling, ontology modeling, logical modeling becomes far more important. If you can't understand what you have, you won't be able to correlate the data properly.|
|Good architecture||Think about augmenting your existing infrastructure. Don't throw your existing EDW away. Leverage what you've built, and augment it by enriching the environment with a co-existent NoSQL / NewSQL component.|
|Lack of commitment to training||This is probably the largest faux-pas that a company can make when embarking on the NoSQL / NewSQL BI journey. It's not just training in the platform that counts, but training in Agility, and Data Vault 2.0 best practices that make the difference.|
Figure 2: Lacks when using a data lake only
Back to the question, if we need a data warehouse. The short answer is, it depends: If you don’t have an existing data warehouse and all you do is data science work or investigating your data, you probably don’t want or need a data warehouse. But, if you need structure, analysis, multi-integrated systems (integrated by consistent business keys), tie your data back to the business processes by business key, then you definitely need a data warehouse. In summary: when you want to extract value out of the data lake (the data in the data lake), a data warehouse is a mature concept.
When a data warehouse already exists, a best practice is to augment your existing data warehouse (RDBMS) by integrate it in the data lake or probably use a hybrid architecture by replacing the relational staging area by a HDFS based staging area which captures all unstructured and structured data.
2 adopted from https://www.smartdatacollective.com/big-data-cheat-sheet-what-executives-want-know/
How to Get Updates and Support
Please send inquiries and feature requests to [email protected].
To support the creation of Visual Data Vault drawings in Microsoft Visio, a stencil is implemented that can be used to draw Data Vault models. The stencil is available at www.visualdatavault.com.