Hash Keys in the Data Vault

By | Scalefree Newsletter | One Comment

One of the most obvious changes in Data Vault 2.0 is the introduction of hash keys in the model. These hash keys are mandatory because of the many advantages. Hash keys do not only speed up the loading process; they also ensure that the enterprise data warehouse can span across multiple environments: on-premise databases, Hadoop clusters and cloud storage.

Let’s discuss the performance gain first: to increase the loading procedures, dependencies in the loading process have to be minimized or even eliminated. Back in Data Vault 1.0 sequence numbers were used to identify a business entity and that had to include dependencies during the loading process as a consequence. These dependencies have slowed down the load process what is especially an issue in real-time-feeds. Hubs had to be loaded first before the load process of the satellites and links could start. The intention is to break these dependency by using the hash keys instead of sequence numbers as the primary key.

Business Keys vs Hash Keys

In advance, business keys may be a sequence number created by a single source system, e.g. the customer number. But, business keys can also be a composite key to uniquely identify a business entity, e.g. a flight in the aviation industry is identified by the flight number and the date because the flight number will be reused every day.

In general: a business key is the natural key used by the business to identify a business object.

While using the business keys in Data Vault might be an option, it is actually a slow one, using a lot of storage (even more than hash keys). Especially in links and their dependent satellites, many composite business keys are required to identify the relationship or transaction / event in a link – and to describe it in the satellite. This would require a lot of storage and slow down the loading process because not all database engines have the capability to execute efficient joins on variable length business keys. On the other hand we would have too many columns in the link, because every business key must be a part of the link. The issue at this point is that we also have different data types with different lengths in the links. This issue is exaggerated because it is also required to replicate the business keys into their satellites. To guarantee a consistent join performance, the solution is to combine the business keys into a single column value by using hash functions to calculate a unique representation of a business object.

Massively Parallel Processing (MPP)

Due to the independence during the load process of hubs, links and satellites, it is possible to do that all in parallel.

The idea is to use the fact that a hash key is derived from a business key or combination of business keys without the need of a lookup in the a parent table. Therefore, instead of looking up the sequence of a business key in a hub before describing the business key in the satellite, we can just calculate the hash key of the business key. The (correct) implementation of the hash function ensures that the same semantic business key leads to exactly the same hash key, regardless of the target entity loaded. Read More

The Value of Non-Historized Links

By | Scalefree Newsletter | No Comments

When Dan Linstedt, co-founder of Scalefree, invented the Data Vault, he had several goals in mind. One of the goals was to load data as fast as possible from the source into a data warehouse model, process it into information and present it to the business analyst in any desired target structure.

For simplicity and automation, the Data Vault model exists only of three basic entity types:

  1. Hubs: a distinct list of business keys
  2. Links: a distinct list of relationships between business keys
  3. Satellites: descriptive data, that describe the parent (business key or relationship) from a specific context, versioned over time.

Now, as we always teach (and sometimes preach): you can model all enterprise data using these three entity types alone. However, a model using only these entity types would have multiple disadvantages. Many complex joins, storage consumption, ingestion performance and missed opportunities for virtualization.

The solution? Adding a little more nuts and bolts to the core entity types of the Data Vault in order to cope with these issues. One of the nuts and bolts is the non-historized link, also known as Transaction Link:

In this example, Sales is a non-historized link that captures sales transactions of a customer, related to a store. The goal of the non-historized link is to ensure high performance on the way into the data warehouse and on the way out. Don’t forget, the ultimate goal of data warehousing is to build a data warehouse not just model it. And building a data warehouse involves much more than just the model: it requires people, processes, and technology. Read More