An Enterprise Document Warehouse Architecture

By | Scalefree Newsletter | No Comments
A common requirement for enterprise data warehousing is to provide an analytical model for information delivery, for example in a dashboard or reporting solution. One challenge in this scenario is that the required target model, often a dimensional star or snowflake schema or just a denormalized flat-and-wide entity, doesn’t match the source data structure. Instead the end-user of the analytical data will directly or indirectly define the target structure according to the information requirements.

Another challenge is the data itself, regardless of its structure.
In many, if not most, cases, the source data doesn’t meet the information requirements of the user regarding its content. In many cases, the data needs cleansing and transformation before it can be presented to the user.

Instead of just loading the data into a MongoDB collection and wrangling it until it fits the needs of the end user, the Data Vault 2.0 architecture proposes an approach that allows data as well as business rules, which are used for data cleansing in addition to transformation, to be re-used by many users. To achieve this, it is made up of a multi-layered architecture that contains the following layers:

Read More

Processing Enterprise Data with Documents in MongoDB

By | Scalefree Newsletter | No Comments
Today’s enterprise organizations receive and process data from a variety of sources, including silos generated by web as well as mobile applications, social media, artificial intelligence solutions in addition to IoT sensors. That said, the efficient processing of this data at high volume in an enterprise setting is still a challenge for many organizations. 

Typical challenges include issues such as the integration of mainframe data with real-time IoT messages and hierarchical documents.
One of such issues being that enterprise data is not clean and might have contradicting characteristics as well as interpretations. This poses a challenge for many processes such as when integrating customers from multiple source systems.

Though, data cleansing could be considered as a solution to this problem. However, what if different data cleansing rules should be applied to the incoming data set? For example, because the basic assumption for “a single version of the truth” doesn’t exist in most enterprises. While one department might have a clear understanding of how the incoming data should be cleansed, another department, or an external party, might have another understanding. 

Read More

DATA VAULT 2.0’s INVENTOR OFFERS UNPRECEDENTED ON-SITE ACCESS

By | Scalefree Newsletter | No Comments

To all those that have been a part of the Scalefree journey up until this point,

We’d first and foremost like to thank you for all the contributions you have made in helping us build Scalefree into the firm it is today. All of your contributions and business have allowed us to create a success story beyond what was first imagined and for that we offer our gratitude.

That said, a recent development here at Scalefree has presented the company with the opportunity to offer unprecedented, on-site access to the man that helped make all of this possible, the inventor of Data Vault 2.0, Dan Linstedt.

Though before diving into the unique opportunity that presents you, a little about how we got here.

Read More

Data Vault Use Cases Beyond Classical Reporting: Part 1

By | Scalefree Newsletter | No Comments

To put it simply, an Enterprise Data Warehouse (EDW) collects data from your company’s internal as well as external data sources, to be used for simple reporting and dashboarding purposes. Often, some analytical transformations are applied to that data as to create the reports and dashboards in a way that is both more useful and valuable. That said, there exist additional valuable use cases which are often missed by organizations when building a data warehouse. The truth being, EDWs can access untapped potential beyond simply reporting statistics of the past. To enable these opportunities, Data Vault brings a high grade of flexibility and scalability to make this possible in an agile manner.

Data Vault Use Cases

To begin, the data warehouse is often used to collect data as well as preprocess the information for reporting and dashboarding purposes only. When only utilizing this single aspect of an EDW, users are missing opportunities to take advantage of their data by limiting the EDW to such basic use cases.

A whole variety of use cases can be realized by using the data warehouse, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or to trigger events outside the data warehouse, to simply explore but a few new opportunities available.

Read More

What to consider for naming conventions in Data Warehousing – Part 1

By | Scalefree Newsletter | 3 Comments

An initial decision of critical importance within Data Vault development relates to the definition of naming conventions for database objects. As part of the development standardization, these conventions are mandatory as to maintain a well-structured and consistent Data Vault model. It is important to note that proper naming conventions boost usability of the data warehouse, not only for solution developers but also for power users within data exploration.

Throughout this article, we will present the most vital considerations within our standard book, the process of defining naming conventions.

Naming convention documentation

It is one aspect to simply define naming conventions utilized within the development of your data warehouse, but it is completely another to establish consistency as to create defined naming conventions that are to become standards. That said, it is a good practice to document a guideline for naming Data Warehouse objects. To that end, the next sections will discuss several considerations to take account of when defining the naming conventions for a data warehouse solution.  Read More

Bridge Tables 101: Why they are useful

By | Scalefree Newsletter | No Comments

Within Data Vault there are special entities which leverage the query performance on the way out of the Data Vault. These entities are placed between the Data Vault and the Information Delivery Layer and are necessary for instances in which many joins and aggregations on the Raw Data Vault are executed what cause performance issues. This often happens when designing the virtualized fact tables in the information and data marts. Thus, to produce the required granularity in the fact tables without increasing the query time, Bridge tables come into play. Bridge tables belong to the Business Vault and have the purpose of improving performance, similar in manner to the PIT table which was discussed in a prior newsletter.

As a means to achieve its goals, the bridge table materializes the grain shift that is often required within the information delivery process. Though, before we dig deeper into the specifics of using a bridge table for performance tuning, it is important to first define granularities within a data warehouse.

Grain Definitions in Data Warehousing

The grain within a dimensional model is the level of detail available of each table. Thus, the grain of a fact table is defined by the number of related dimensions. Basically, there are three different types of granularities for fact entities within a dimensional model. Read More

Alternative to the Driving Key Implementation in Data Vault 2.0

By | Scalefree Newsletter | 19 Comments

Back in 2017 we introduced the link structure with an example of a Data Vault model in the banking industry. We showed how the model looks like when a link represents either a relationship or a transaction between two business objects. A link can also connect more than two hubs. Furthermore, there is a special case when a part of the hub references stored in a link can change without describing a different relation. This has a great impact on the link satellites. What is the alternative to the Driving Key implementation in Data Vault 2.0?

The Driving Key

A relation or transaction is often identified by a combination of business keys in one source system. In Data Vault 2.0 this is modelled as a normal link connecting multiple hubs each containing a business key. A link contains also its own hash key, which is calculated over the combination of all parents business keys. So when the link connects four hubs and one business key changes, the new record will show a new link hash key. There is a problem when four business keys describe the relation, but only three of them identify it unique. We can not identify the business object by using only the hash key of the link. The problem is not a modeling error, but we have to identify the correct record in the related satellite when query the data. In Data Vault 2.0 this is called a driving key. It is a consistent key in the relationship and often the primary keys in the source system. Read More

How to implement insert only in Data Vault 2.0?

By | Scalefree Newsletter | No Comments

Skilled modeling is important to harness the full potential of Data Vault 2.0. To get the most out of the system due to scalability and performance, it also has to be built on an architecture which is completely insert only. On the way into the Data Vault, all update operations can be eliminated and loading processes simplified.

THE COMMON IMPLEMENTATION

In the common loading patterns, there are two important technical timestamps in Data Vault 2.0. The first is the load date timestamp (LDTS). This timestamp does not represent a business date that comes from the source system. Instead, it provides the information about when the data was first loaded into the data warehouse, usually the staging area. Read More

How to use Point in Time Tables (PIT) in the Insurance Industry?

By | Scalefree Newsletter | 4 Comments

A problem that occurs when querying the data out of the Raw Data Vault happens when there are multiple satellites on a hub or a link:

Figure 1: Data Vault model including PIT (logical)
In the above example, there are multiple satellites on the hub Customer and link included in the diagram. This is a very common situation for data warehouse solutions because they integrate data from multiple source systems. However, this situation increases the complexity when querying the data out of the Raw Data Vault. The problem arises because the changes to the business objects stored in the source systems don’t happen at the same time. Instead, a business object, such as a customer (an assured person), is updated in one of the many source systems at a given time, then updated in another system at another time, etc. Note that the PIT table is already attached to the hub, as indicated by the ribbon. Read More

Managed Self-Service BI – Success in spite of stringent regulation

By | Scalefree Newsletter | No Comments
The latest story to find itself added to the growing number of successful implementations of Scalefree’s services, and Data Vault 2.0 as a whole, centers around a sector known for its strict regulatory bodies in addition to high volume of data that demands the utmost in terms of privacy and security.

As the events that unfolded during the various financial crises at the start of the century left governments the world over seeking to impose stricter regulations for the financial sector. Banks within the sector were faced with a new task as they sought to continue operating with expansion in mind while still falling well within defined standards. Read More