Slowly Changing Dimensions

I mucked up the CFO's report.

A huge customer moved to another state. Like a good minion, I updated the customer's address in the data warehouse.

When the annual report went out, the CFO asked why the historical totals by state were different. 😰

Answer: The data warehouse had no record of the customer's previous address. My change overwrote the old address.

That's when I felt the pain of poor data modeling. My data warehouse was not designed to reflect what the company needed, like customers' current address AND their previous addresses.

That's also when I came across "Slowly Changing Dimension" tables.

Save your butt from the CFO's fire. Use Slowly Changing Dimensions (SCDs) in your data warehouse.

Background

In the world of data modeling, the Kimball Data Model features a fact table linked to several dimension tables.

Records in a fact table describe events. Rows may represent a transaction on amazon.com or someone checking into the hospital.

Here's a sample fact table of medical encounters at St. Mungo's Hospital for Magical Maladies and Injuries:

fct_encounter

id patient_id encounter_date diagnosis_code diagnosis_desc treatment_code treatment_desc total_cost
E10001 P001 1997-05-15 SM-101 Basilisk Venom Exposure SP-17 Essence of Phoenix Feather 1200
E10002 P001 1996-02-12 SM-205 Broken Wand Backfire Injury SP-03 Wand-Core Realignment 300
E10003 P002 1995-07-21 SM-330 Quidditch Bludger Head Trauma SP-21 Cranial Healing Charm 450

Each record represents what happened to a patient, like the diagnosis, treatment, and cost.

Dimension tables, on the other hand, describe things. Records in a dimension table may represent customers, products, or locations.

Here's a dimension table of patients at St. Mungo's:

dim_patient

id first_name last_name date_of_birth address
P001 Harry Potter 1980-07-31 4 Privet Drive, Little Whinging
P002 Ron Weasley 1980-03-01 The Burrow, Ottery St Catchpole
P003 Hermione Granger 1979-09-19 34 Tennyson Road, London

This dimension table lets us to look up patients by their IDs. We see patients' names, birthdates, addresses, etc.

Fact tables are joined to dimension tables by keys. For example, joining these two tables on fct_encounter.patient_id = dim_patient.id shows us that us Harry (P001) was hit with basilisk venom and a wand backfire; Ron (P002) experienced a bludger blow.

But what makes a dimension table "slowly changing?" In general, records in dimension tables should not change. It's uncommon for someone to change their name and near impossible to change their birthdate. But every now and then, data in dimension tables need to be updated... hence, "slowly changing."

SCDs have different types, or frameworks, for how to update the dimension table with such changes. These frameworks maintain the current value of each attribute (i.e. each column in a table), and some give a way to keep historical values.

SCD Type 1

In Type 1 SCDs, new info replaces old info. Every time a change occurs, the new value overwrites the old one. A historical version of the record is not kept. I had a Type 1 SCD when I foolishly updated the address for that customer.

For example, Harry lives with his aunt and uncle at 4 Privet Drive. But he later inherits his godfather's house at 12 Grimmauld Place. A Type 1 dim_patient table is updated as such:

Before

dim_patient

id first_name address
P001 Harry 4 Privet Drive
P002 Ron The Burrow
P003 Hermione 34 Tennyson Road

After

dim_patient

id first_name address
P001 Harry 12 Grimmauld Place
P002 Ron The Burrow
P003 Hermione 34 Tennyson Road

This approach gives the current state of the dimension. But you lose historical records. And when the CFO inevitably asks for a report about the past, you're cooked. You don't know where Harry lived before 12 Grimmauld Place.

That said, there is a place for Type 1 SCDs. Type 1 tables are useful when you don't care about historical analysis. Sometimes only the current value matters, like when you need contact info for customers.

SCD Type 2

Type 2 SCDs are the next level. When a change occurs, rather than updating the existing record, you insert a new record:

Before

dim_patient

id first_name address
P001 Harry 4 Privet Drive
P002 Ron The Burrow
P003 Hermione 34 Tennyson Road

After

dim_patient

id first_name address
P001 Harry 4 Privet Drive
P001 Harry 12 Grimmauld Place
P002 Ron The Burrow
P003 Hermione 34 Tennyson Road

Great. Now there are two records for Harry. But when joining the dimension table to fact tables... which record of Harry should you use? It's hard to tell.

That's why Type 2 SCDs have helper columns to identify when each version of a row is active. For instance, the start_date and end_date columns outline the window of time each record is effective. If Harry's address changes on 1996-06-30, dim_patient looks like this:

dim_patient

id start_date end_date first_name address
P001 1981-10-31 1996-06-30 Harry 4 Privet Drive
P001 1996-07-01 9999-12-31 Harry 12 Grimmauld Place
P002 1990-01-01 9999-12-31 Ron The Burrow
P003 1992-07-01 9999-12-31 Hermione 34 Tennyson Road

Here, the first record was active from 1981-10-31 to 1996-06-30; the values of start_date and end_date tell us that. The second record is active from 1996-07-01 to... 9999-12-31?

This "high end date" is a convention used to indicate which record is current. It's useful when you're filtering records with SQL's BETWEEN statement. You may also see NULL values for the end_date of the active record.

When it's time to join the fact and dimension tables, our query may look like this:

SELECT *
FROM fct_encounter e
JOIN dim_patient p
    ON e.patient_id = p.id
    AND e.encounter_date BETWEEN p.start_date AND p.end_date;

We pick the proper patient record (i.e. the proper record for Harry) in the join by making sure the encounter_date in the fact table is between the start_date and end_date in the dimension table. This effectively captures the dimension record that was active when the event in the fact table occurred.

There are other ways to indicate which record is current:

  • Set up flag columns like is_active with boolean values to indicate which rows are current.
id effective_date is_active first_name address
P001 1981-10-31 FALSE Harry 4 Privet Drive
P001 1996-07-01 TRUE Harry 12 Grimmauld Place
  • Add a version field with the most up-to-date record having the highest value.
id effective_date version first_name address
P001 1981-10-31 1 Harry 4 Privet Drive
P001 1996-07-01 2 Harry 12 Grimmauld Place

The methods for identifying current records go on and on.

The beauty of Type 2 SCDs is that each version of a record get its own row. This keeps a full history of changes, which enables historical analysis. The downside is that multiple rows represent the same thing (e.g. two rows both represent the same Harry), so you need to carefully filter the table for the version you want.

SCD Type 3

Spoiler alert: I don't like Type 3. Here it goes: When a change occurs, Type 3 tables do not create a new row. Instead new fields are added to mark the old value and the new value.

The first time an address changes, the column address is renamed to original_address. Then new columns current_address and effective_date are added. The original and current values appear in separate columns while effective_date indicates when the current address took effect.

This approach "bakes" historical values into the row itself:

Before

dim_patient

id first_name address
P001 Harry 4 Privet Drive

After

dim_patient

id first_name original_address current_address effective_date
P001 Harry 4 Privet Drive 12 Grimmauld Place 1996-07-01

However, there are two cons:

  • Type 3 SCDs track limited history. When Mr. Potter moves again, you'll update current_address and lose track of how he once lived at Grimmauld Place. You can only keep track of "current" and "original" values.
  • The field count can explode. This table tracks changes to the address, so you have original_address and current_address. But what about other fields? Should you add current/original pairs for last_name, phone_number, marital_status? The table can quickly become too wide to be manageable.

For these reasons, Type 3 SCDs are useful only to track one prior version of a value. It should not be used for full historical tracking.

SCD Type 4

Type 4 is where we part the waters. This approach uses two separate tables for the dimension: a current table and a historical table.

Unsurprisingly, the current table has only the current version of each record. The historical table has all versions with a timestamp of when each record became active.

When a change occurs, the record in the current table is updated, and a new record is inserted in the historical table.

Mr. Potter's address change can be modeled in tables dim_patient and dim_patient_hist as such:

Before

dim_patient

id first_name address
P001 Harry 4 Privet Drive

dim_patient_hist

id first_name address effective_date
P001 Harry 4 Privet Drive 1981-10-31

After

dim_patient

id first_name address
P001 Harry 12 Grimmauld Place

dim_patient_hist

id first_name address effective_date
P001 Harry 4 Privet Drive 1981-10-31
P001 Harry 12 Grimmauld Place 1996-07-01

Originally, Harry's record in dim_patient and dim_patient_hist are basically the same. But once he inherits the new house, his address is updated in dim_patient, and a new record is inserted in dim_patient_hist with the effective date.

Type 4 SCDs keep the current dimension table small and fast for querying. You don't have to worry about multiple rows representing the same entity, like in Type 2 (e.g. there's only one row for Harry in Type 4 dim_patient). And for those few times you need to perform historical analysis, you use the historical table.


SCDs continue to evolve to Types 5 and 6. But I've never seen them used in practice. Check them out if you're curious. Most of the time, engineers are happy with Type 1 and Type 2 SCDs.

Again, Slowly Changing Dimension tables describe "things." They're designed to reflect current values while potentially giving access to historical values. Here's the cheatsheet of how each type handles data change:

SCD Type Update Description
1 New record replaces original record. Old record is lost.
2 New record added to table. Other columns identify when records are in effect.
3 Original record is modified. Extra column gives old value.
4 Current dimension table is updated. New record is added to historical table.

Your data warehouse seems perfect on day 1. But the business will change. The unexpected will happen. Building well-designed tables will future-proof your warehouse and spare you some headache.

Are stakeholders knocking on your door about data accuracy? Call me if you need a data model 😘.