“Big Data”, “Data Lake”, “Data Swamp”, “Hybrid Architecture”, “NoSQL”, “Hadoop” … terms you are confronted with very often these days when you are dealing with data. Furthermore, the question comes up if you really need a data warehouse nowadays when you deal with a high variety and volume of data. We want to talk about what a data lake is, if we need a data warehouse when using NoSQL platforms like Hadoop, and how it is combined with Data Vault.
WHAT IS A DATA LAKE?
Last month we talked about a hybrid architecture in Data Vault 2.0, where we explain how to combine structured and unstructured data with a hybrid architecture. To follow up on this topic, we now want to explain how your business users (especially power users) can take a benefit from it with the managed Self-Service Business Intelligence (mSSBI) approach in Data Vault 2.0.
ABOUT SELF-SERVICE BI
Self-service BI allows end-users to completely circumvent IT due to this unresponsiveness of IT. In this approach, business users are left on their own with the whole process of sourcing the data from operational systems, integration and consolidation of the raw data. There are many problems with this self-service approach without the involvement of IT:
Business users expect from their data warehouse systems to load and prepare more and more data, regarding the variety, volume, and velocity of data. Also, the workload that is put on typical data warehouse environments is increasing more and more, especially if the initial version of the warehouse has become a success with its first users. Therefore, scalability has multiple dimensions. Last month we talked about Satellites, which play an important role regarding the scalability. Now we explain how to combine structured and unstructured data with a hybrid architecture.
LOGICAL DATA VAULT 2.0 ARCHITECTURE
The Data Vault 2.0 architecture is based on three layers: the staging area which collects the raw data from the source systems, the enterprise data warehouse layer, modeled as a Data Vault 2.0 model, and the information delivery layer with information marts as star schemas and other structures. The architecture supports both batch loading of source systems and real-time loading from the enterprise service bus (ESB) or any other service-oriented architecture (SOA).
Data Vault 2.0 is a concept for data warehousing, invented by Dan Linstedt. It brings many new features that help anyone who is concerned with Business Intelligence entering a new age of data warehousing. Data Vault 2.0 is a Big Data concept that integrates relational data warehousing with unstructured data warehousing in real-time. It is an extensible data model where new data sources are easy to add. When our founders wrote the book, they required a visual approach to model the concepts of Data Vault in the book. For this purpose, they developed the graphical modeling language, which focuses on the logical aspects of Data Vault. The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.
SATELLITES IN VISUAL DATA VAULT
Satellites add descriptive data to hubs and links. Descriptive data is stored in attributes that are added to the satellite. The individual attributes are added to the satellite one at a time. A satellite might be attached to any hub or link. However, it is only possible to attach the satellite to one parent. Read More
With the advent of Data Vault 2.0, which adds architecture and process definitions to the Data Vault 1.0 standard, Dan Linstedt standardized the Data Vault symbols used in modeling. Based on these standardized symbols, the Visual Data Vault (VDV) modeling language was developed, which can be used by EDW architects to build Data Vault models. When our founders wrote the book, they, required a visual approach to model the concepts of Data Vault in the book. For this purpose, they developed the graphical modeling language, which focuses on the logical aspects of Data Vault. The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.
LINKS IN VISUAL DATA VAULT
In June this year we published another newsletter how hubs are modeled in the accounting industry. In this Newsletter we explain the function of standard links and how the modeling in the banking industry works.
Links connect individual hubs in a Data Vault model and represent either transactions or relationships between business objects. Business objects are connected in business. No business object is entirely separate from other business objects. Instead, they are connected to each other through the operational business processes that use business objects in the execution of their tasks. The image below shows a link that connects two hubs (a standard link has to have at least two connections). Read More
One common requirement in data warehouse projects is to provide data lineage from end-to-end. However, custom solutions (for example custom Meta Marts for self-developed Data Vault generators) or tools from different vendors often break such end-to-end data lineage.
Unlike business or technical metadata, which is provided by the business or source applications, process execution metadata is generated by the data warehouse team and provides insights into the ETL processing for maintenance. The data is used by the data warehouse team or by end-users to better understand the data warehouse performance and results presented in the information marts. One type of process execution metadata is the control flow metadata which executes one or more data flows among other tasks. Logging the process execution provides a valuable tool for maintaining or debugging the ETL processes of the data warehouse because it provided information about the data lineage of all elements of the data warehouse. Read More
With the advent of Data Vault 2.0, which adds architecture and process definitions to the Data Vault 1.0 standard, Dan Linstedt standardized the Data Vault symbols used in modeling. Based on these standardized symbols, the Visual Data Vault (VDV) modeling language was developed, which can be used by EDW architects to build Data Vault models. The authors of the book “Building a Scalable Data Warehouse”, who are the founders of Scalefree, required a visual approach to model the concepts of Data Vault in the book. For this purpose, they developed the graphical modeling language, which focuses on the logical aspects of Data Vault. The Microsoft Visio stencils and a detailed white paper are available on www.visualdatavault.com as a free download.
Hubs in Visual Data Vault
Business keys play an important role in every business, because they are referenced by business transactions and relationships between business objects. Whenever a business identifies and tracks business objects, business keys are used throughout business processes. This is one of the reasons why Data Vault is based on the business keys. In Data Vault models, business keys are stored in hub entities. The challenge is to identify the business keys which represent a business object uniquely. That can be just one business key, but also a composite key or a smart key. The first image shows a hub with only one business key attribute:
One of the most obvious changes in Data Vault 2.0 is the introduction of hash keys in the model. These hash keys are mandatory because of the many advantages. Hash keys do not only speed up the loading process; they also ensure that the enterprise data warehouse can span across multiple environments: on-premise databases, Hadoop clusters and cloud storage.
Let’s discuss the performance gain first: to increase the loading procedures, dependencies in the loading process have to be minimized or even eliminated. Back in Data Vault 1.0 sequence numbers were used to identify a business entity and that had to include dependencies during the loading process as a consequence. These dependencies have slowed down the load process what is especially an issue in real-time-feeds. Hubs had to be loaded first before the load process of the satellites and links could start. The intention is to break these dependency by using the hash keys instead of sequence numbers as the primary key.
Business Keys vs Hash Keys
In advance, business keys may be a sequence number created by a single source system, e.g. the customer number. But, business keys can also be a composite key to uniquely identify a business entity, e.g. a flight in the aviation industry is identified by the flight number and the date because the flight number will be reused every day.
In general: a business key is the natural key used by the business to identify a business object.
While using the business keys in Data Vault might be an option, it is actually a slow one, using a lot of storage (even more than hash keys). Especially in links and their dependent satellites, many composite business keys are required to identify the relationship or transaction / event in a link – and to describe it in the satellite. This would require a lot of storage and slow down the loading process because not all database engines have the capability to execute efficient joins on variable length business keys. On the other hand we would have too many columns in the link, because every business key must be a part of the link. The issue at this point is that we also have different data types with different lengths in the links. This issue is exaggerated because it is also required to replicate the business keys into their satellites. To guarantee a consistent join performance, the solution is to combine the business keys into a single column value by using hash functions to calculate a unique representation of a business object.
Massively Parallel Processing (MPP)
Due to the independence during the load process of hubs, links and satellites, it is possible to do that all in parallel.
When Dan Linstedt, co-founder of Scalefree, invented the Data Vault, he had several goals in mind. One of the goals was to load data as fast as possible from the source into a data warehouse model, process it into information and present it to the business analyst in any desired target structure.
For simplicity and automation, the Data Vault model exists only of three basic entity types:
- Hubs: a distinct list of business keys
- Links: a distinct list of relationships between business keys
- Satellites: descriptive data, that describe the parent (business key or relationship) from a specific context, versioned over time.
Now, as we always teach (and sometimes preach): you can model all enterprise data using these three entity types alone. However, a model using only these entity types would have multiple disadvantages. Many complex joins, storage consumption, ingestion performance and missed opportunities for virtualization.
The solution? Adding a little more nuts and bolts to the core entity types of the Data Vault in order to cope with these issues. One of the nuts and bolts is the non-historized link, also known as Transaction Link:
In this example, Sales is a non-historized link that captures sales transactions of a customer, related to a store. The goal of the non-historized link is to ensure high performance on the way into the data warehouse and on the way out. Don’t forget, the ultimate goal of data warehousing is to build a data warehouse not just model it. And building a data warehouse involves much more than just the model: it requires people, processes, and technology. Read More