Capturing Semi-Structured Descriptive Data

By | Scalefree Newsletter | No Comments
The previous articles within this series have presented hub and link entities to capture business keys as well as the relationships between business keys. To illustrate, the hub document collection in MongoDB is a distinct list of business keys used to identify customers. 

As to capture the descriptive data, which in this case is the describing factor of the business keys, satellite entities are used in Data Vault. As both business keys and relationships between business keys can be described by user data, satellites may be attached to hub as well as link entities as such:

Read More

Identifying Additional Relationships between Documents

By | Scalefree Newsletter | No Comments
The last article within our series recently covered the Data Vault hub entity which is used to capture distinct list of business keys in an enterprise data warehouse as most integration will actually occur on these hub entities themselves. However, there are scenarios in which the integration of data solely on these hub entities is not sufficient enough for the necessary end goal in mind. 

Consider this situation in which a sample data set, involving an insurance company, concerning customers signing car and home insurance policies as well as filing claims, each respectively. Though before moving forward with the example, it is important to note that there are relationships between the involved business keys, that of the customer number, the policy identifiers, and the claims.

These relationships are captured by Data Vault link entities and just like hubs, they contain a distinct list of records, as such, they contain no duplicates in terms of stored data. Thus, both will form the skeleton of Data Vault and later be described by descriptive user data stored in satellites.

Read More

Integrating Documents from Heterogeneous Sources

By | Scalefree Newsletter | No Comments
Within this part of our ongoing blog series, we would like to introduce a sample data set based upon insurance data. This data set will be used to explain the concepts and patterns expanded upon further in the post. That said, please consider the following situation: an insurance company utilizes two different operational systems, let’s say, a home insurance policy system and a car insurance policy system.

Both systems should be technically integrated, which means if a new customer signs up for a home insurance policy, the customer’s data should be synchronized into the car insurance policy system as well and kept in sync at all times. Thus, when the customer relocates, the new address is updated within both systems.

Though in reality, it often doesn’t go quite as one would expect, as, first of all, both systems are usually not well integrated or simply not integrated at all. Adding to the complexity, in some worst-case scenarios, data is manually copied from one system to the next and updates are not applied to all datasets in a consistent fashion but only to some, leading to inconsistent, contradicting source datasets. The same situation applies often to data sets after mergers and acquisitions are made within an organization.

Read More

Document Processing in MongoDB

By | Scalefree Newsletter | No Comments
In continuing our ongoing series, this piece within the blog series will describe the basics of querying and modifying data in MongoDB with a focus on the basics needed for the Data Vault load as well as query patterns. 

In contrast to the tables used by relational databases, MongoDB uses a JSON-based document data model. Thus, documents are a more natural way to represent data as a single structure with related data embedded as sub-documents and arrays collapses what is otherwise separated into parent-child tables linked by foreign keys in a relational database. You can model data in any way that your application demands – from rich, hierarchical documents through to flat, table-like structures, simple key-value pairs, text, geospatial data, and the nodes as well as edges used in graph processing.

Read More

An Enterprise Document Warehouse Architecture

By | Scalefree Newsletter | No Comments
A common requirement for enterprise data warehousing is to provide an analytical model for information delivery, for example in a dashboard or reporting solution. One challenge in this scenario is that the required target model, often a dimensional star or snowflake schema or just a denormalized flat-and-wide entity, doesn’t match the source data structure. Instead the end-user of the analytical data will directly or indirectly define the target structure according to the information requirements.

Another challenge is the data itself, regardless of its structure.
In many, if not most, cases, the source data doesn’t meet the information requirements of the user regarding its content. In many cases, the data needs cleansing and transformation before it can be presented to the user.

Instead of just loading the data into a MongoDB collection and wrangling it until it fits the needs of the end user, the Data Vault 2.0 architecture proposes an approach that allows data as well as business rules, which are used for data cleansing in addition to transformation, to be re-used by many users. To achieve this, it is made up of a multi-layered architecture that contains the following layers:

Read More

Processing Enterprise Data with Documents in MongoDB

By | Scalefree Newsletter | No Comments
Today’s enterprise organizations receive and process data from a variety of sources, including silos generated by web as well as mobile applications, social media, artificial intelligence solutions in addition to IoT sensors. That said, the efficient processing of this data at high volume in an enterprise setting is still a challenge for many organizations. 

Typical challenges include issues such as the integration of mainframe data with real-time IoT messages and hierarchical documents.
One of such issues being that enterprise data is not clean and might have contradicting characteristics as well as interpretations. This poses a challenge for many processes such as when integrating customers from multiple source systems.

Though, data cleansing could be considered as a solution to this problem. However, what if different data cleansing rules should be applied to the incoming data set? For example, because the basic assumption for “a single version of the truth” doesn’t exist in most enterprises. While one department might have a clear understanding of how the incoming data should be cleansed, another department, or an external party, might have another understanding. 

Read More

DATA VAULT 2.0’s INVENTOR OFFERS UNPRECEDENTED ON-SITE ACCESS

By | Scalefree Newsletter | No Comments

To all those that have been a part of the Scalefree journey up until this point,

We’d first and foremost like to thank you for all the contributions you have made in helping us build Scalefree into the firm it is today. All of your contributions and business have allowed us to create a success story beyond what was first imagined and for that we offer our gratitude.

That said, a recent development here at Scalefree has presented the company with the opportunity to offer unprecedented, on-site access to the man that helped make all of this possible, the inventor of Data Vault 2.0, Dan Linstedt.

Though before diving into the unique opportunity that presents you, a little about how we got here.

Read More

Data Vault Use Cases Beyond Classical Reporting: Part 1

By | Scalefree Newsletter | No Comments

To put it simply, an Enterprise Data Warehouse (EDW) collects data from your company’s internal as well as external data sources, to be used for simple reporting and dashboarding purposes. Often, some analytical transformations are applied to that data as to create the reports and dashboards in a way that is both more useful and valuable. That said, there exist additional valuable use cases which are often missed by organizations when building a data warehouse. The truth being, EDWs can access untapped potential beyond simply reporting statistics of the past. To enable these opportunities, Data Vault brings a high grade of flexibility and scalability to make this possible in an agile manner.

Data Vault Use Cases

To begin, the data warehouse is often used to collect data as well as preprocess the information for reporting and dashboarding purposes only. When only utilizing this single aspect of an EDW, users are missing opportunities to take advantage of their data by limiting the EDW to such basic use cases.

A whole variety of use cases can be realized by using the data warehouse, e.g. to optimize and automate operational processes, predict the future, push data back to operational systems as a new input or to trigger events outside the data warehouse, to simply explore but a few new opportunities available.

Read More

What to consider for naming conventions in Data Warehousing

By | Scalefree Newsletter | 3 Comments

An initial decision of critical importance within Data Vault development relates to the definition of naming conventions for database objects. As part of the development standardization, these conventions are mandatory as to maintain a well-structured and consistent Data Vault model. It is important to note that proper naming conventions boost usability of the data warehouse, not only for solution developers but also for power users within data exploration.

Throughout this article, we will present the most vital considerations within our standard book, the process of defining naming conventions.

Naming convention documentation

It is one aspect to simply define naming conventions utilized within the development of your data warehouse, but it is completely another to establish consistency as to create defined naming conventions that are to become standards. That said, it is a good practice to document a guideline for naming Data Warehouse objects. To that end, the next sections will discuss several considerations to take account of when defining the naming conventions for a data warehouse solution.  Read More

Bridge Tables 101: Why they are useful

By | Scalefree Newsletter | No Comments

Within Data Vault there are special entities which leverage the query performance on the way out of the Data Vault. These entities are placed between the Data Vault and the Information Delivery Layer and are necessary for instances in which many joins and aggregations on the Raw Data Vault are executed what cause performance issues. This often happens when designing the virtualized fact tables in the information and data marts. Thus, to produce the required granularity in the fact tables without increasing the query time, Bridge tables come into play. Bridge tables belong to the Business Vault and have the purpose of improving performance, similar in manner to the PIT table which was discussed in a prior newsletter.

As a means to achieve its goals, the bridge table materializes the grain shift that is often required within the information delivery process. Though, before we dig deeper into the specifics of using a bridge table for performance tuning, it is important to first define granularities within a data warehouse.

Grain Definitions in Data Warehousing

The grain within a dimensional model is the level of detail available of each table. Thus, the grain of a fact table is defined by the number of related dimensions. Basically, there are three different types of granularities for fact entities within a dimensional model. Read More