How to Deploy Slowly Changing Dimensions?
In the previous blog, I had explained the different types of Slowly Changing Dimensions (SCD) that can be used in dimensional modeling for tracking changes. In this blog, I shall be focusing on the Type 2 form of SCD (Slowly Changing Dimensions) and the issues associated with its implementation.
Type 2 Slowly Changing Dimensions (SCD) Implementation Recap
Before that, using an example, I would like to do a quick recap on Type 2 Slowly Changing Dimensions (SCD) implementation. While building star schema’s in a data warehouse, the dimensions tables are joined with the fact tables. To track the changes in a dimension, the Type 2 technique of Slowly Changing Dimensions (SCD) is used. In this type when values for the current dimension record change, the current record is marked as “closed”, and the new record gets inserted.
So, for example, if Mike moves from Los Angeles to New York, then there will be 2 records associated with Mike in the updated dimension table. The latest record will be marked as open/active while the old record will be marked as inactive/closed.
Original Record
Customer ID | Master Customer ID | Name | City | IsActive |
1111 | 1111 | Mike | Los Angeles | 1 |
2222 | 2222 | Jane | San Francisco | 1 |
Updated Record
Customer ID | Master Customer ID | Name | City | IsActive |
1111 | 3333 | Mike | Los Angeles | 0 |
2222 | 2222 | Jane | San Francisco | 1 |
3333 | 3333 | Mike | New York | 1 |
There are a few challenges one is likely to face while fetching the latest information related to a customer. Let’s take the example of Mike and Jane which I had talked of in the first part while explaining a Type 2 SCD (slowly changing dimensions type 2), further.
Let’s suppose both Mike and Jane have one order containing one product when they were living in Los Angeles and San Francisco, respectively. After moving to New York, Mike updated his city information on the e-commerce website but did not purchase any product.
Customer ID | Order ID | Product ID | Order Amount |
1111 | ABCD | 123 | 599 |
2222 | EFGH | 1234 | 499 |
1111 | IJKL | 12 | 201 |
In the above scenario, if all we care about is Mike and Jane’s Master Customer ID and associated sale, then we can simply join the customer table with the sales table using customer id and roll the information by Master Customer ID. The query should look like below
Select cu.Master Customer ID,
sum(sl.Order Amount) as Total Amount
From Customer cu
Join Sales sl on cu.Customer ID = sl.Customer ID
Group by cu.Master Customer ID
The result would look like below.
Master Customer ID | Order Amount |
3333 | 800 |
2222 | 400 |
However, if we need some more recent information about the customer then the above method will not work. There are a few ways by which you can achieve this.
Achieve business success with our data analytics consulting services
Joining the customer table twice
In this method, there is another join needed with the customer table in the above query so that the customers’ latest information is selected and displayed in the result. Modified query is copied below:
Select cu.Master Customer ID,
Mast.Name,
Mast.City,
sum(Order Amount) as Total Amount
From Customer cu
Join Sales sl on cu.Customer ID = sl.Customer ID
Join Customer Mast on cu.Master Customer ID = Mast.Customer ID and Mast.IsActive = 1
Group by cu.Master Customer ID,
Mast.Name,
Mast.City
Although the above method is straightforward, it increases the cost of the query, especially if the number of columns in the customer table is more. This results in slow query performance.
The above query still gives faster results for columnar databases such as Redshift, Actian, MonetDB. But row-based databases such as Oracle or Teradata really struggle in producing the output quickly. To mitigate this, 2 views can be created. One will have only 2 columns containing Master Customer ID and related customer IDs. The other view will have a snapshot of the latest customer information frequently needed. Two customer joins mentioned in the above query will be replaced by these 2 views.
Using Analytic Functions:
We can get similar results by using Analytic functions such as First_value. Modified query is copied below:
Select cu.Master Customer ID,
cu.Name,
FIRST_VALUE(cu.City) OVER (PARTITION BY Master Customer ID ORDER BY Customer ID DESC ROWS UNBOUNDED PRECEDING) AS LatestCity,
sum(NVL(Order Amount,0)) as Total Amount
From Customer cu
Left Join Sales sl on cu.Customer ID = sl.Customer ID
Group by cu.Master Customer ID,
cu.Name,
LatestCity
;
The result is copied below:
Master Customer ID | Name | City | Order Amount |
3333 | Mike | New York | 800 |
2222 | Jane | San Francisco | 400 |
The above query should produce similar results.
There are some drawbacks to the above method. We have to create a left join between Customer & Sales. If we create an inner join, then the results will be incorrect. Apart from that, the left join will create an additional row for the records which do not have associated sales information.
E.g.: In the case of Mike, a record with null order information will be created for Customer ID 3333 since that ID does not have any corresponding sales information. This can be tackled by adding the NVL function to replace the null order amount with 0. If we do not want to aggregate then we can write a nested query where we can add a not null condition in where clause to remove the null record associated with blank order id.
In conclusion: It is clear that although SCDs type 2 (slowly changing dimensions type 2) are used to track historical changes one needs to be careful while writing queries & fetching the correct information.
An Engine That Drives Customer Intelligence
Oyster is not just a customer data platform (CDP). It is the world’s first customer insights platform (CIP). Why? At its core is your customer. Oyster is a “data unifying software.”
Liked This Article?
Gain more insights, case studies, information on our product, customer data platform
No comments yet.