Hash Keys in the Data Vault

By April 28, 2017Architecture

One of the most obvious changes in Data Vault 2.0 is the introduction of hash keys in the model. These hash keys are mandatory because of the many advantages. Hash keys do not only speed up the loading process; they also ensure that the enterprise data warehouse can span across multiple environments: on-premise databases, Hadoop clusters and cloud storage.

Let’s discuss the performance gain first: to increase the loading procedures, dependencies in the loading process have to be minimized or even eliminated. Back in Data Vault 1.0 sequence numbers were used to identify a business entity and that had to include dependencies during the loading process as a consequence. These dependencies have slowed down the load process what is especially an issue in real-time-feeds. Hubs had to be loaded first before the load process of the satellites and links could start. The intention is to break these dependency by using the hash keys instead of sequence numbers as the primary key.

Business Keys vs Hash Keys

In advance, business keys may be a sequence number created by a single source system, e.g. the customer number. But, business keys can also be a composite key to uniquely identify a business entity, e.g. a flight in the aviation industry is identified by the flight number and the date because the flight number will be reused every day.

In general: a business key is the natural key used by the business to identify a business object.

While using the business keys in Data Vault might be an option, it is actually a slow one, using a lot of storage (even more than hash keys). Especially in links and their dependent satellites, many composite business keys are required to identify the relationship or transaction / event in a link – and to describe it in the satellite. This would require a lot of storage and slow down the loading process because not all database engines have the capability to execute efficient joins on variable length business keys. On the other hand we would have too many columns in the link, because every business key must be a part of the link. The issue at this point is that we also have different data types with different lengths in the links. This issue is exaggerated because it is also required to replicate the business keys into their satellites. To guarantee a consistent join performance, the solution is to combine the business keys into a single column value by using hash functions to calculate a unique representation of a business object.

Massively Parallel Processing (MPP)

Due to the independence during the load process of hubs, links and satellites, it is possible to do that all in parallel.

The idea is to use the fact that a hash key is derived from a business key or combination of business keys without the need of a lookup in the a parent table. Therefore, instead of looking up the sequence of a business key in a hub before describing the business key in the satellite, we can just calculate the hash key of the business key. The (correct) implementation of the hash function ensures that the same semantic business key leads to exactly the same hash key, regardless of the target entity loaded.

Hash Keys to Join ApacheTM Hadoop Data Sets

Without hashing the load to HadoopⓇ or NoSQL requires a lookup on the hub or link sequence in the relational system before it can insert or attach it’s data. Hashing instead of sequencing means that we can load in complete 100% parallel operations to all hubs, all links, all satellites, and enrich all Hadoop or NoSQL based documents in parallel at the same time. It also then allows to join across multiple heterogeneous platforms – from Teradata Database to ApacheTM HadoopⓇ for example.

Hash Difference

The hash difference column applies to the satellites. The approach is the same as with the business keys, only that here all the descriptive data is hashed. That reduces the effort during an upload process because just one column has to be looked up. The satellite upload process first examines if the hash key is already in the satellite, and secondly if there are differences between the hash difference values.

The image above shows the process if the source system sends you a full data delivery. The point where to hash the values should be during the way into the staging area, because at this point there is time to check for “duplicate collisions”, and handle hashing issues before continuing with the data in the Data Vault.

What Hash Function to use

There are many hash functions to choose from: MD5, MD6, SHA-1, SHA-2, and some more. We recommend to use the MD5 algorithm with a length of 128 bit in most cases, because it is most ubiquitously available across most platforms, and has a decently low percentage chance of duplicate collision with an acceptable storage requirement.

An additional advantage hashing brings with it, is that the hashes have the same length and are from the same data type, from which a performance improvement arises. Furthermore, hash values are generated, so they never get lost and can be recreated.

Collision

Hashing has a very small risk: the collision. That means, two different data will get the same hash value. But, the risk is very small, for example: In a database with more than one trillion hash values, the probability that you will get a collision is like the odds of a meteor landing on your data center.

Summary

Hash Keys are not optional in the Data Vault, they are a must. The advantages of

  • massively Parallel Processing (MPP),
  • data load performance,
  • consistency and
  • auditability

are indispensable and can not be reached with Sequences or Business Keys.

Hash keys are no silver bullet…but they provide (much) more advantages than they introduce disadvantages (storage).

Join the discussion One Comment

  • “That means, two different data will get the same hash value. But, the risk is very small, for example: In a database with more than one trillion hash values, the probability that you will get a collision is like the odds of a meteor landing on your data center.”

    I’m glad to see an advocate of DV 2.0 finally taking the data volume into account when discussing the probability of a hash-collision. This is certainly an improvement in comparison to the first few writings by Linstedt on the topic.

    It is a bit disappointing though that the way the statistic is stated seems to serve only to convince the future DV 2.0 practioner that the chance is neligible and can be ignored in practice.

    Assuming an ideal hash function, the chance of a collision in one keyset (i.e. one hub) can be expressed as a function of both the data volume and the number of possible keyvalues. Why not be more transparent and explain how you arrived at the number? It would be much more useful if you’d give people a means to calculate the chance of a collision (given a particular hash function and maximum data volume), or the maximum data volume (given a particular hash function and the risk they are willing to take to run into a collision)

    You’re of course right to state that even with very large data volumes, the commonly available hash functions would yield a very small chance of a collision. For the specific example of 1 trillion values and a keylength of 128 bits, the chance is approximately 1*10^-15. (many useful functions to make such approximations can be found here: http://preshing.com/20110504/hash-collision-probabilities/) But if we were to have multiple of such hubs, the chance increases by that factor for each hub. If 1000 DV practitioners each build a 1000 DV’s, each with a 1000 hubs then the chance of at least one running into a collision is already 1 in a million.

    At any rate the chance will always be very small and I am not disputing that at all. I’m just saying that as the volume per hub grows, and the number of hubs grows, the chances are ever increasing.

    However, all of this detracts from the fact that no matter how small the chances are, they can still happen. No matter how close your hash function comes to an ideal hash function, and no matter how many values it could theoretically cover, the next row you load could in principle always result in a collision. You can simply be out of luck, and you cannot predict when it will happen. The best you can do is predict that it won’t happen often.

    But when it does happen, it is of little comfort that you ended up to have such extraordinarily bad luck: No matter how many trillions of 1$ transactions you might have successfully loaded, you might encounter your hash collision just as you were about to load the one transaction worth 1 trillion $. It will be no good that the chance was small as in this example, all the successful loads combined were worth just as much as that one failure due to hash collision.

    So, in short, it’s not good enough to convince us that the chance of a collision is very small. What DV 2.0 needs to do is to explain exactly what to do when it does happen. Do you take the loss and explain to your customer that you built a solution that simply loses data sometimes because it can’t load it due to a collision? Or do you stop the load process and take the DV offline in order to choose another hash function and/or hash key size? And how would you justify that to your customer?

    Whatever the answer is, it cannot be another hash key, since the only thing that does is further decrease the chance of a collision. But regardless of how big or small the chance is, we still need a solution for when it does happen.

    Thus far I have not seen any DV 2.0 advocate explain in clear and plain English what the recommended strategy is to deal with collisions when loading the hub.

    I applaud you for explaining the load of satellites with the flow diagram. If I understand the loading scheme proposed by your flow diagram correctly, then you simply drop the row in case the hash of the key of the satellite already exists. So if the lookup succeeds, but it was a colliding key, then you simply don’t load the satellite row (i.e. you lose data). Is that correct? If so, then it is rather concerning that not even an attempt is made to detect a collision. This means you’d be losing data silently and query results will be wrong.

    The text doesn’t explain in detail what the “hash difference value” is exactly, but it seems clear that the same assumption is held that if the hashes appear to line up, the data must have been the same. So again – in case of a hash collision, the data is silently discarded, without even an attempt to detect the collision.

    I have heard people argue: “you maybe right, but I have loaded many DV 2.0 vaults, and never encountered a collision”. Then my answer would be that if this loading scheme is typical for the practice, then – regardless of the small chances of a collision, of which I am well aware – it is not surprising that you never encountered it, since you have literally no mechanism in place to detect a collision. So even if you’d run into a collision, you wouldn’t even know.

    DV is an interesting technology and DV 2.0 seems promising for many purposes. But as long as the concern of collisions is not adressed, and no concrete methodology is offered to detect and mitigate collisions, you cannot trust DV 2.0 to be a reliable system of record. There are many clever people working in the DV space – it would be cool if they’d put their energy in proposing a reliable and performant solution for this problem, rather than to keep trying to convince DV practitioners not to worry about collisions at all based on the chance of running in to them is so small.

    Best regards,

    Roland Bouman

Leave a Reply