The Value of Non-Historized Links

By April 3, 2017Modeling

When Dan Linstedt, co-founder of Scalefree, invented the Data Vault, he had several goals in mind. One of the goals was to load data as fast as possible from the source into a data warehouse model, process it into information and present it to the business analyst in any desired target structure.

For simplicity and automation, the Data Vault model exists only of three basic entity types:

  1. Hubs: a distinct list of business keys
  2. Links: a distinct list of relationships between business keys
  3. Satellites: descriptive data, that describe the parent (business key or relationship) from a specific context, versioned over time.

Now, as we always teach (and sometimes preach): you can model all enterprise data using these three entity types alone. However, a model using only these entity types would have multiple disadvantages. Many complex joins, storage consumption, ingestion performance and missed opportunities for virtualization.

The solution? Adding a little more nuts and bolts to the core entity types of the Data Vault in order to cope with these issues. One of the nuts and bolts is the non-historized link, also known as Transaction Link:

In this example, Sales is a non-historized link that captures sales transactions of a customer, related to a store. The goal of the non-historized link is to ensure high performance on the way into the data warehouse and on the way out. Don’t forget, the ultimate goal of data warehousing is to build a data warehouse not just model it. And building a data warehouse involves much more than just the model: it requires people, processes, and technology.

So, how does the non-historized link meets its goals? Think about your business analysts. What are their goals? In the end, to be honest, they don’t care about a Data Vault model. Instead they would like to see dimensional models, such as star schema’s and snow flake models or flat-and-wide models for data mining. Or, once in a while, they want to see the ugly-looking tables from the mainframe, sometimes linked, sometimes not, and not a lot of people fully understand the relationships anymore…but for backwards compability, that’s just great.

Having defined the target, the next question comes into mind: what is the target granularity? For example, in a dimensional model, the target granularity of fact tables often reflects the transactions to be analyzed (think about call records in the telecommunications industry, or banking transactions).

Interestingly, this desired target granularity can often be found directly in the source systems. Because a telecommunications provider has an operational system in place that records each phone call. Or a banking application that records every account transaction. And these records are typically loaded to the data warehouse without aggregation (at least in Data Vault where we are interested in the finest granularity for auditing and delivery purposes).

And here comes the problem with the standard Data Vault entity types. While they are very simple and patternized, they have one problem. It’s the fact that the standard link “stores a distinct list of relationships”, as stated above. That means, the link is only interested in relationships from the source that are unknown to the target link. If a customer walks into a store multiple times, and purchases the same product, the relationship between customer (number), store (number) and product (number) is already known and no additional link entry is added.

As a result, the granularity of the incoming data is changed when loading the target link. If the transaction’s underlying relationship is already known, the transaction would be omitted (and instead captured by a satellite).

The next problem is that the link granularity now differs from the target granularity because the business analyst wanted one record per transaction and not per distinct business key relationship. Another grain shift is required that typically involves joining the satellite of the link to the link itself to recover the original grain.

As we explained in our book “Building a Scalable Data Warehouse with Data Vault 2.0” a grain shift is relatively costly in terms of performance. This is because the operation requires costly GROUP BY statements or LEFT and RIGHT JOINS.

And for what? The end result of both operations often results in the original granularity from the source system. Two expensive grain shifts for nothing sounds like a bad deal.

And it is.

That is where the non-historized link comes into play: the link is a simple variation of the standard link with the goal to capture the source transactions and events at the original granularity.

In a standard link, the granularity is defined by the number of hub references in the link. This would not be sufficient to capture multiple transactions that involve the same business keys (e.g., the same customer, store, and product) in the same target link. To resolve the issue, an additional element of the alternate key on the non-historized link is required. Typically a transaction or event ID, such as the call ID or a basket ID in the retail industry (which also can be modeled as own Hub). In combination with the hub references, the combined business key should be unique per transaction. This way, we can now capture one record in the target link per source record from the operational system.

Figure: Non-Historized Link with Sales ID as an additional key

One deeper scenario would be, if the same product (in that case the highest granularity) emerges twice in a sale, because of different discounts for example. In that case, the line item number would be an additional key (Dependent Child Key) to make every data set unique.

From a loading perspective, the ingestion process of the incoming data is much improved, because no grain shift is required anymore. All data is loaded directly from the source into the target. And as long as the business analyst wants to report exactly on this granularity, it is actually possible to restructure the non-historized link table into a fact table by using a virtual SQL view on top.
One major point is that a non-historized link do not allow edits to the descriptive data, meaning there is never any history issued on the facts (e.g. sensor or machine generated data). Because the descriptive data cannot change, there is no need to put it in a Satellite, rather descriptive data is kept directly in the non-historized link.

For performance reasons, avoid grain shifts on the way out. If a different target granularity is required, and it is not possible to load this granularity from the source system or from the Data Vault model, consider a bridge table. The purpose of this entity type is to materialize the grain shift, while keeping the advantage of customizing the target according to the individual requirements set for the dimensional target.

Figure: Virtual Bridge Table as Fact Table with GROUP BY Store Hash Key

Join the discussion 7 Comments

  • What should I do if there is no useful additional key like the sales_id in the example.
    The source only provides the load timestamp?

    • Hello Oliver,
      with the assumption that the transaction itself is not related to a business process (like a sale), but comes in real-time from a machine, sensor or similar, you can use a sequence in the NH-Link.
      Best regards,
      Your Scalefree Team

  • Siva J says:

    As there are no satellites for NH-Link and descriptive data is part of it, what if any attributes need to be added or removed as per source data changes ? Will it not impact/compromise the flexibility ?

    • Hi Siva,

      in order to add descriptive data to a non-historized link, you have two options:

      1. Have the descriptive attributes denormalized to the link entity
      2. Have a “non-historized satellite” (PK = HK of parent, no LDTS) attached to the link with a 1:1 relationship

      If your descriptive data changes, turn it into a standard satellite. If the attributes in the link change over time, use counter-transactions.

      Hope that clarifies it a bit. We discuss these topics in training in much more detail (hope to see you there 😉 )


      a non-historized link supports the use of “non-historized satellites” bes

  • Richard Hudson says:

    If a source system holds counter-transactions with a reference back to the original transaction being reversed and we load both original and counter transactions into a non-historized link; what would be the preferred approach to preserving that reference between the original and counter transactions received from the source system?
    The only options I can think of are:
    1) Is this one of the rare cases for using a Link against another Link, with, say, a dedicated ‘Reversal’ Link joining together the two non-historized link records for original and counter transactions – should not slow down the loading of the actual transactions into the non-historized link if it is just loaded where it occurs in an independent process, but does introduce additional change dependencies on the non-historized link, albeit pretty focused as the additional link only joins to the same non-historized link twice.
    2) Document the id of the original transaction in a non-historized satellite against the non-historized link record for the counter transaction – feels like a bit of a cop-out as there is a genuine relationship between the two non-historized transactions, it is not really just descriptive information and would make any attempt to navigate between the original and counter transactions by consumption processes very inefficient.

    • Lara Brandenburg says:

      Hi Richard,

      the recommendation here is to use one Non-Historized Link for both transactions, and the audit trail is kept in a Same-As-Link. When your source system delivers counter transactions with relation to the original one, this is the best-case scenario you can have to make the loading as easy as possible while having an audit trail in place. The assumption here is that each transaction has a consistent singular ID (e.g. transaction ID) in addition to the original reference. Use this attribute as the “dependent child key” (part of the Link Hash Key but no own Hub) and load the counter transaction into the same Non-Historized Link as they come from the source. The assumption here is that the values are multiplied with (-1) by source system, or any other recognition that this is a counter transaction. Additionally, load both, the Link Hash Key from the original record and the Link Hash Key from the counter record into a Same-as-Link to keep the audit trail with the effective date.

      Kind regards,
      Your Scalefree Team

Leave a Reply