Multiple schemas (MS)

Suppose we have 2 source systems A and B. Both systems contain information regarding employees. System A contains login_name and email address, System B contains department.

In our datawarehouse we like to merge this information. My preferred way of doing is is the following.

System A has a schema called A in My_DWH.

System B has a schema called B in My_DWH.

So in this example My_DWH contains the table A.user and B.user.
We can also create a table or view called dbo.user in My_DWH that contains a merged view of the two.

It’s important to have A.user and B.user for lineage. You need to be able to go back to the source of records in dbo.user.

* It’s better to use this pattern also when you only have System A in your data warehouse (i.e. there is only 1 user entity). Otherwise you will have a lot to do when System B is added at a later time.

When using this pattern in Staging, we avoid issues when different sources have the same table names.

[MS_src_schema_id]

When in the above example records from schema A and B are merged in My_DWH.dbo.user, we might get uniqueness issues of natural keys. To solve these issues we always add a column named src_schema_id to the natural key.

e.g. dbo.user
sur_key , src_schema_id          , natural_key ,   synonym_id 
1              , A                                   , ‘bvdberg’        , null
2              , B                                   , ‘bvdberg’        , 1

The combination of src_schema_id and natural_key is unique and we can use the [Syn] pattern to de-duplicate these records in DWH. Note that we can choose to only show sur_key 1 in the datamart. (surrogate foreign keys will not point to entities that have synonyms).

 

Natural key [NK]

Natural keys are keys that can be used to identify entities in the domain. They must exist in the source system. For example: customer code, order number, etc.

In our datawarehouse we create a new key, called a surrogate key. This key is usually an incrementing integer.

A natural key can be primuary or foreign. For example the table Customer has the following columns:

customer_id        : surrogate key
customer_code   : primary natural key
country_code     : foreign natural key

The primary natural key can be used for historization.

Design patterns definition

“In software engineering, a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design. A design pattern is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations. Patterns are formalized best practices that the programmer can use to solve common problems when designing an application or system.” [ https://en.wikipedia.org/wiki/Software_design_pattern ]

I will give each pattern an easy to remember alias. E.g. Vocabulary -> VOC. Sub-patterns might be numbered. E.g. VOC-1.

Disclaimer: I learned a lot and used a lot from the following modeling methods:

[table “” not found /]

Vocabulary [Voc]

[Voc-0] Naming convention. Use English language. Use singular instead of plural. Objects start with a capital and words are separated with underscores (e.g. Sales_agent). Linking tables may have multiple capitals (e.g. Sales_agent_Department). Columns are lowercase.

We use the following postfixes and prefixes:

[table “” not found /]

[table “” not found /]

[Voc-1] Don’t reuse domain vocabulary
One best practice is not to reuse the vocabulary of our source systems. This has several reasons:
• We want to be able to (re)define concepts. Using the same vocabulary is confusing.
• Source systems may change. We want to minimize our dependency on these changes
• It’s best practice to model the domain from a business or customer perspective (top-down) instead of a source system perspective (bottom-up). Source system focus mainly on getting things done. We want to focus on real world facts.
• Exception: In Staging we use the naming of the source system (no renaming).

Leaving the stone age of business intelligence

 

Introduction

‘Starship Nasa16 enters orbit of the planet Kepler-452b. The captain orders his crew to gain as much intelligence as possible on the population of this planet. The engineers manage to establish a link to one of the main data centers of Kepler-452b. Now all focus shifts to the BI team who have to extract and interpret the data. While the whole crew is anxiously waiting for the BI team, the stress level in the BI team is rising to the top.
The captain listens to the desperate utterances of the the BI team:

O no, they are using Oracle and we have Microsoft.
Something went wrong in the daily job.
The job hangs because of a few strange records in our source data.

They have no clue what the domain looks like, which entities to look for, so they are constantly making assumptions based on short fragments of data. Several times they have to start over from scratch because they made the wrong assumptions, causing a lot of frustration because work has to be thrown away. After a few weeks the captains decides to go back to earth and come back when the BI team has learned enough about the alien population.‘

I wrote this anecdote to illustrate that in my view we are still living in the stone age of business intelligence. Every time we build a new datawarehouse to gain knowledge on a particular domain we start from scratch building extract, transform and load code. When problems are encountered we build a solution. But solutions are rarely reused because they contain domain specific code. In short: there is too little reuse of ETL code.

In an ideal BI environment BI developers are only working on interpreting the domain and configuring the ETL building blocks to Extract, Tranform and Load the source data into an Enterpise data model. They should not be troubled by programming code to build historic tables, to lookup foreign keys, to read source files, to de-normalize tables, to handle errors, logging, etc. If a new situation is encountered, the BI developer should think about designing a generic reusable building block, instead of fixing the situation at hand.
In this article I will discuss these generic reusable building blocks, or design patterns. I also welcome your feedback, so that we can improve them or identify them as best practices.

Continue reading on Bi design patterns

Historization [His]

Source systems store facts that are valid at the moment you look. Right after you retrieved a dataset, it might already be outdated because some process performed an update. So we have to keep it mind that we take snapshots of a dataset that are valid at snapshot time T.

Example
According to a snapshot taken at march 11 09:11 we might calculcate an active customer base at end of period (EOP) Februari of 10.320 customers.
The next day we calculcate the exact same metric (base EOP februari), but now based on a snapshot taken at march 12 09:03.
This time the active base at EOP februari is 10319. How can it be possible that changes between march 11 and 12 have impact on the base of februari?
Various reasons can be hold responsible:

  • updates with backwards effect: A cancellation was made undone at march 11. A customer called explaining that he did not want to cancel in the first place, so the order status was incorrect. Using some dirty trick the user made sure that the status was changed from aborted into successfully delivered. When looking at the snapshot of march 12 it looks like this customer never aborted in the first place
  • Late arriving facts. At March 11 the supplier sends the information to the source system that some user was delivered at Februari 29.

Solution

2 concepts of time:

Using above example we can say that the fact that base EOP februari = 10320 was true from  march 11 09:11 till march 12 09:03

The valid time for the above scenario would be never for EOP februari = 10320 and always for EOP februari = 10319. However this was known to us only at march 12 09:03.

Usïng two notions of time we can answer questions like: why has the active base EOP februari changed since the last time I looked.

Implementation

[table “” not found /]

Usage

Using historization for an object or not should not influence the way it is used. When we don’t apply historization dbo.<object> just points to a table with no history. In this case there is no his.<object> table.

99 % of the cases we just want to query the latest view. But in case we want to answer questions like above we can query the his.[object] table. It will show the changes that happened on March 11. Usually this is enough to answer the question.

[His-1] Datamarts only contain latest data.

Depending on your domain history might not be required in the Datamart and might only slow things down and make querying more complex. That’s why we choose not to include history in the datamart.

[His-2] Delete detection

We want to identify deleted records. Instead of deleting these records in his.[object] we choose to set the delete_dt record. The latest view filters out these records. But when performing an investigation we can look into his.[object] to find deleted records.

Metric value [Mv]

Aplies to: datamart.

We want to store the calculations and definitions that are used by the business in our management reports in a central location using a readable language. Not interwoven into the many query’s that are part of the many reports that we produce. This is important because these calculations and definitions may change and the number of reports may grow rapidly.

The solution

We have 3 tables and a stored procedure.

[table “” not found /]

Example

[table “” not found /]
[table “” not found /]
[table “” not found /]
In the above example two contracts are sold on March-02 and one contract is aborted on March 03. If we want to see the sales in march we can just sum all values in Fct_Metric_value_stats with cal_date in march. We can easily build a drill-down report to show the individual contracts that were sold in March.

usp_update_metric_value

This procedure empties the Fct_Metric_value table and fills it using fresh calculations.

Metrics are mainly defined by queries based on the main Fact table (e.g. Fct_Contract) and Dim_Calendar.

For performance reasons we aggregate some metrics only per end of week, month, quarter and year in Fct_Metric_value_stats.

Pros

  • Performance: querying Fct_Metric_value_stats is much faster than performing the calculation during report execution.
  • Efficiency: analists don’t have to know the definitions by head. The can just select the right metric.
  • Data quality: when defintions change, you only have to change usp_update_metric_value. All reports use the same metric definitions. This is very useful when using drilldown reports.

Cons

  • Fct_Metric_value might become quite large. We need to keep focus on performance.

Synonyms [Syn]

Example

New customers are asked about the name of their previous provider. They can choose to select a value from the combo-box, or they can enter a new value. What happens is that different provider names exist in the source like zigo, ziggo, Zigg, Ziggo geloof ik, Upc, etc.

We want to map all theses different names to one single provider named Ziggo.

Solution

When the data gets normalized in level 1 of DWH we get a table called Brand, that contains all different provider names: e.g.

10 Zigo
11 Ziggo
12 Zigg
13 Ziggo geloof ik

The ETL automatically adds a column named synonym_id to each lookup table. synonym_id defines a parent-child relation. We have to fill this column once manually (or using some fussy logic query).

id
brand_name
synonym_id
10 Zigo 11
11 Ziggo
12 Zigg 11
 13  Ziggo geloof ik 11

 

When synonym_id is filled then all lookups must use synonym_id instead of the primary key. (.e.g. by using this tsql:

select isnull(synonym_id, brand_id) brand_id
from dbo.Brand

 

The table work_order will look something like this (brand_name comes from staging, brand_id is looked up).

id
register_dt
brand_id
 brand_name
100 2015-03-01 11  Zigg
101 2015-02-14 11 Ziggo

 

[Syn-1] Foreign key check

We have added a check is usp_historize to prevent the situation where synonym_id refers to a non existing or deleted record.
This check is done for every update of a historic table that contains synonym_id. We could improve this ETL by only performing this check when records are deleted.

 

Unknown date [UD]

Context

When we want to determine the active customer base we use this SQL:

Or if you prefer this SQL:

To make this SQL easier we might replace the NULL values for column billing_term_date in Fct_Contract by a constant max date (e.g. 2079-01-01). This constant should be interpreted as billing was not terminated (equal to NULL value).

Pros:

  • We can define this max date once in the ETL and not in every query that calculates the active customer base.
  • We don’t have to check for NULL values. (this is also a performance improvement). We might as well set the billing_term_date column not to allow NULL values.
  • The sql gets simpler:

  • This pattern might be used in an architecture where we don’t allow NULL values in our datamart (especially for foreign keys). Just because it makes our querying easier and faster. (See UFK pattern)

Cons:

  • We cannot use the query part  where billing_term_date is not null to find terminated customers. Instead we could use this:

or this

  • It feels a bit like a dirty trick, because you are tweaking the data to make your querying easier. Currently I’m not using this pattern.

 

Unknown foreign key [UFK]

We prefer to minimize NULL values for foreign keys, because :

  • Allowing NULL values makes queries more complex. We don’t want SQL like this where user_id in ( select user_id from X where user_id is not null)
  • We don’t want null values in report comboboxes. E.g. user:  A, B, NULL
  • In SSIS we dislike the following code: not User_Id is null && User_id = 32
  • We prefer INNER JOIN over LEFT JOIN because it’s faster. When a foreign key can be null we have to use left join because else we would be filtering.

Solution

We replace NULL values by an Unknown value. E.g. when a user is unknown we set the user_id to -1 which points to a user with the name unknown. This way we can always inner join and show unknown values in report.

Notes

NULL values are useful for facts, e.g. download_speed. When download speed is NULL it means that there is no download speed defined (e.g. for sales propositions that do not contain internet). This is different than having a download speed of 0.