Only transfer data that has changed

Let’s be honest. How many times do you transfer data from A to B because there might be something changed, but actually there is nothing changed.

Because simple copy tasks don’t take up a lot of time it’s not considered important to write code that checks whether source data has changed. However when working with a generic datawarehouse it’s relatively easy to implement check ‘is the source changed?’ no, then do nothing.

Implementation:

indexed last_update_dt colunmn in source.
compare this with meta-data last_update_dt for this specific transfer, src, target combination.  It can also be done with 100 % meta-data.

 

 

Multiple schemas (MS)

Suppose we have 2 source systems A and B. Both systems contain information regarding employees. System A contains login_name and email address, System B contains department.

In our datawarehouse we like to merge this information. My preferred way of doing is is the following.

System A has a schema called A in My_DWH.

System B has a schema called B in My_DWH.

So in this example My_DWH contains the table A.user and B.user.
We can also create a table or view called dbo.user in My_DWH that contains a merged view of the two.

It’s important to have A.user and B.user for lineage. You need to be able to go back to the source of records in dbo.user.

* It’s better to use this pattern also when you only have System A in your data warehouse (i.e. there is only 1 user entity). Otherwise you will have a lot to do when System B is added at a later time.

When using this pattern in Staging, we avoid issues when different sources have the same table names.

[MS_src_schema_id]

When in the above example records from schema A and B are merged in My_DWH.dbo.user, we might get uniqueness issues of natural keys. To solve these issues we always add a column named src_schema_id to the natural key.

e.g. dbo.user
sur_key , src_schema_id          , natural_key ,   synonym_id 
1              , A                                   , ‘bvdberg’        , null
2              , B                                   , ‘bvdberg’        , 1

The combination of src_schema_id and natural_key is unique and we can use the [Syn] pattern to de-duplicate these records in DWH. Note that we can choose to only show sur_key 1 in the datamart. (surrogate foreign keys will not point to entities that have synonyms).

 

Natural key [NK]

Natural keys are keys that can be used to identify entities in the domain. They must exist in the source system. For example: customer code, order number, etc.

In our datawarehouse we create a new key, called a surrogate key. This key is usually an incrementing integer.

A natural key can be primuary or foreign. For example the table Customer has the following columns:

customer_id        : surrogate key
customer_code   : primary natural key
country_code     : foreign natural key

The primary natural key can be used for historization.

Design patterns definition

“In software engineering, a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design. A design pattern is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations. Patterns are formalized best practices that the programmer can use to solve common problems when designing an application or system.” [ https://en.wikipedia.org/wiki/Software_design_pattern ]

I will give each pattern an easy to remember alias. E.g. Vocabulary -> VOC. Sub-patterns might be numbered. E.g. VOC-1.

Disclaimer: I learned a lot and used a lot from the following modeling methods:

[table “” not found /]

Vocabulary [Voc]

[Voc-0] Naming convention. Use English language. Use singular instead of plural. Objects start with a capital and words are separated with underscores (e.g. Sales_agent). Linking tables may have multiple capitals (e.g. Sales_agent_Department). Columns are lowercase.

We use the following postfixes and prefixes:

[table “” not found /]

[table “” not found /]

[Voc-1] Don’t reuse domain vocabulary
One best practice is not to reuse the vocabulary of our source systems. This has several reasons:
• We want to be able to (re)define concepts. Using the same vocabulary is confusing.
• Source systems may change. We want to minimize our dependency on these changes
• It’s best practice to model the domain from a business or customer perspective (top-down) instead of a source system perspective (bottom-up). Source system focus mainly on getting things done. We want to focus on real world facts.
• Exception: In Staging we use the naming of the source system (no renaming).

Historization [His]

Source systems store facts that are valid at the moment you look. Right after you retrieved a dataset, it might already be outdated because some process performed an update. So we have to keep it mind that we take snapshots of a dataset that are valid at snapshot time T.

Example
According to a snapshot taken at march 11 09:11 we might calculcate an active customer base at end of period (EOP) Februari of 10.320 customers.
The next day we calculcate the exact same metric (base EOP februari), but now based on a snapshot taken at march 12 09:03.
This time the active base at EOP februari is 10319. How can it be possible that changes between march 11 and 12 have impact on the base of februari?
Various reasons can be hold responsible:

  • updates with backwards effect: A cancellation was made undone at march 11. A customer called explaining that he did not want to cancel in the first place, so the order status was incorrect. Using some dirty trick the user made sure that the status was changed from aborted into successfully delivered. When looking at the snapshot of march 12 it looks like this customer never aborted in the first place
  • Late arriving facts. At March 11 the supplier sends the information to the source system that some user was delivered at Februari 29.

Solution

2 concepts of time:

Using above example we can say that the fact that base EOP februari = 10320 was true from  march 11 09:11 till march 12 09:03

The valid time for the above scenario would be never for EOP februari = 10320 and always for EOP februari = 10319. However this was known to us only at march 12 09:03.

Usïng two notions of time we can answer questions like: why has the active base EOP februari changed since the last time I looked.

Implementation

[table “” not found /]

Usage

Using historization for an object or not should not influence the way it is used. When we don’t apply historization dbo.<object> just points to a table with no history. In this case there is no his.<object> table.

99 % of the cases we just want to query the latest view. But in case we want to answer questions like above we can query the his.[object] table. It will show the changes that happened on March 11. Usually this is enough to answer the question.

[His-1] Datamarts only contain latest data.

Depending on your domain history might not be required in the Datamart and might only slow things down and make querying more complex. That’s why we choose not to include history in the datamart.

[His-2] Delete detection

We want to identify deleted records. Instead of deleting these records in his.[object] we choose to set the delete_dt record. The latest view filters out these records. But when performing an investigation we can look into his.[object] to find deleted records.

Metric value [Mv]

Aplies to: datamart.

We want to store the calculations and definitions that are used by the business in our management reports in a central location using a readable language. Not interwoven into the many query’s that are part of the many reports that we produce. This is important because these calculations and definitions may change and the number of reports may grow rapidly.

The solution

We have 3 tables and a stored procedure.

[table “” not found /]

Example

[table “” not found /]
[table “” not found /]
[table “” not found /]
In the above example two contracts are sold on March-02 and one contract is aborted on March 03. If we want to see the sales in march we can just sum all values in Fct_Metric_value_stats with cal_date in march. We can easily build a drill-down report to show the individual contracts that were sold in March.

usp_update_metric_value

This procedure empties the Fct_Metric_value table and fills it using fresh calculations.

Metrics are mainly defined by queries based on the main Fact table (e.g. Fct_Contract) and Dim_Calendar.

For performance reasons we aggregate some metrics only per end of week, month, quarter and year in Fct_Metric_value_stats.

Pros

  • Performance: querying Fct_Metric_value_stats is much faster than performing the calculation during report execution.
  • Efficiency: analists don’t have to know the definitions by head. The can just select the right metric.
  • Data quality: when defintions change, you only have to change usp_update_metric_value. All reports use the same metric definitions. This is very useful when using drilldown reports.

Cons

  • Fct_Metric_value might become quite large. We need to keep focus on performance.

Synonyms [Syn]

Example

New customers are asked about the name of their previous provider. They can choose to select a value from the combo-box, or they can enter a new value. What happens is that different provider names exist in the source like zigo, ziggo, Zigg, Ziggo geloof ik, Upc, etc.

We want to map all theses different names to one single provider named Ziggo.

Solution

When the data gets normalized in level 1 of DWH we get a table called Brand, that contains all different provider names: e.g.

10 Zigo
11 Ziggo
12 Zigg
13 Ziggo geloof ik

The ETL automatically adds a column named synonym_id to each lookup table. synonym_id defines a parent-child relation. We have to fill this column once manually (or using some fussy logic query).

id
brand_name
synonym_id
10 Zigo 11
11 Ziggo
12 Zigg 11
 13  Ziggo geloof ik 11

 

When synonym_id is filled then all lookups must use synonym_id instead of the primary key. (.e.g. by using this tsql:

select isnull(synonym_id, brand_id) brand_id
from dbo.Brand

 

The table work_order will look something like this (brand_name comes from staging, brand_id is looked up).

id
register_dt
brand_id
 brand_name
100 2015-03-01 11  Zigg
101 2015-02-14 11 Ziggo

 

[Syn-1] Foreign key check

We have added a check is usp_historize to prevent the situation where synonym_id refers to a non existing or deleted record.
This check is done for every update of a historic table that contains synonym_id. We could improve this ETL by only performing this check when records are deleted.

 

Unknown date [UD]

Context

When we want to determine the active customer base we use this SQL:

Or if you prefer this SQL:

To make this SQL easier we might replace the NULL values for column billing_term_date in Fct_Contract by a constant max date (e.g. 2079-01-01). This constant should be interpreted as billing was not terminated (equal to NULL value).

Pros:

  • We can define this max date once in the ETL and not in every query that calculates the active customer base.
  • We don’t have to check for NULL values. (this is also a performance improvement). We might as well set the billing_term_date column not to allow NULL values.
  • The sql gets simpler:

  • This pattern might be used in an architecture where we don’t allow NULL values in our datamart (especially for foreign keys). Just because it makes our querying easier and faster. (See UFK pattern)

Cons:

  • We cannot use the query part  where billing_term_date is not null to find terminated customers. Instead we could use this:

or this

  • It feels a bit like a dirty trick, because you are tweaking the data to make your querying easier. Currently I’m not using this pattern.