0
0
dbtdata~15 mins

Snapshot tables for historical tracking in dbt - Deep Dive

Choose your learning style9 modes available
Overview - Snapshot tables for historical tracking
What is it?
Snapshot tables are special tables that keep a history of changes in your data over time. Instead of just showing the latest data, they store every version of a record when it changes. This helps you track how data looked at different points in the past. In dbt, snapshot tables automate this process by capturing changes during each run.
Why it matters
Without snapshot tables, you lose the story of how your data evolved. This makes it hard to analyze trends, audit changes, or fix mistakes. Snapshot tables let you answer questions like 'What was the status last month?' or 'When did this value change?'. They make your data trustworthy and useful for historical analysis.
Where it fits
Before learning snapshot tables, you should understand basic SQL and dbt models. After mastering snapshots, you can explore advanced data versioning, slowly changing dimensions, and time travel queries in data warehouses.
Mental Model
Core Idea
Snapshot tables capture and store every change in your data over time, creating a timeline of historical records.
Think of it like...
Imagine taking a photo of your desk every day. Each photo shows how your desk looked that day, even if you moved things around later. Snapshot tables are like those daily photos for your data.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Original Data │──────▶│ Snapshot Run 1│──────▶│ Snapshot Table│
│ (Current)    │       │ (Detects diff)│       │ (Stores state)│
└───────────────┘       └───────────────┘       └───────────────┘

Each run compares current data to last snapshot and stores changes.
Build-Up - 7 Steps
1
FoundationUnderstanding data changes over time
🤔
Concept: Data changes and why tracking history matters.
Data in databases often changes: prices update, statuses shift, or user info edits. Without saving old versions, you only see the latest state. This loses valuable history needed for analysis, audits, or debugging.
Result
You realize that just current data is not enough for many real-world questions.
Understanding that data evolves helps you see why keeping history is important.
2
FoundationWhat snapshot tables do
🤔
Concept: Snapshot tables store every version of a record when it changes.
A snapshot table keeps a full history by saving a new row each time a record changes. It adds metadata like when the change happened and if the record is current or old.
Result
You get a table that shows all past states, not just the latest.
Knowing snapshots save changes over time is key to historical tracking.
3
IntermediateHow dbt snapshot works
🤔
Concept: dbt automates snapshot creation and change detection.
In dbt, you define a snapshot with a unique key and columns to track. Each run compares current data to the last snapshot. If differences appear, dbt inserts a new row with updated info and timestamps.
Result
You have an automated process that keeps your snapshot table up to date.
Seeing dbt handle change detection removes manual work and errors.
4
IntermediateTypes of snapshot strategies
🤔Before reading on: do you think dbt snapshots only detect changes by comparing all columns or can it track specific columns? Commit to your answer.
Concept: dbt supports different strategies to detect changes: 'check' and 'timestamp'.
The 'check' strategy compares specified columns for any change. The 'timestamp' strategy uses a column that records last update time to detect changes. Choosing the right strategy depends on your data and update patterns.
Result
You can tailor snapshots to your data's nature for efficient tracking.
Knowing multiple strategies helps you optimize snapshot performance and accuracy.
5
IntermediateSnapshot table schema and metadata
🤔
Concept: Snapshot tables include special columns to track history.
Besides your data columns, snapshot tables have columns like 'dbt_valid_from' and 'dbt_valid_to' to mark when a record version was active. A 'dbt_is_current' flag shows if the row is the latest version.
Result
You get a clear timeline and current state indicator for each record.
Understanding metadata columns is essential to query snapshots correctly.
6
AdvancedHandling slowly changing dimensions with snapshots
🤔Before reading on: do you think snapshot tables can replace all types of slowly changing dimensions or only some? Commit to your answer.
Concept: Snapshots can implement type 2 slowly changing dimensions by storing full history of changes.
Slowly changing dimensions track attribute changes over time. Snapshots store each change as a new row, preserving history. This supports analyses that need past attribute values, like customer status changes.
Result
You can manage complex historical data scenarios with snapshots.
Knowing snapshots support type 2 slowly changing dimensions expands their practical use.
7
ExpertPerformance and storage considerations
🤔Before reading on: do you think snapshot tables grow indefinitely without limits or can they be managed? Commit to your answer.
Concept: Snapshot tables can grow large and impact performance; managing them requires strategies.
Because snapshots store every change, tables can become very large. Techniques like partitioning, archiving old data, or filtering snapshots help manage size and query speed. dbt snapshots also support incremental builds to optimize runs.
Result
You maintain efficient, scalable snapshot tables in production.
Understanding growth and management prevents performance bottlenecks in real systems.
Under the Hood
dbt snapshots run SQL queries that compare current source data to the last snapshot state using keys and tracked columns. When differences are found, dbt inserts new rows with updated data and timestamps. It uses metadata columns to mark validity periods and current versions. This process happens inside your data warehouse, leveraging SQL for change detection and storage.
Why designed this way?
Snapshots were designed to automate historical tracking without complex ETL pipelines. Using SQL and metadata columns fits well with data warehouse architectures. The design balances simplicity, auditability, and performance. Alternatives like full change data capture require more infrastructure and complexity, which dbt snapshots avoid.
┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Source Table  │──────▶│ dbt Snapshot  │──────▶│ Snapshot Table│
│ (Current data)│       │ (Compare data)│       │ (Store history)│
└───────────────┘       └───────────────┘       └───────────────┘
       │                      │                       │
       │                      │                       │
       ▼                      ▼                       ▼
  New data arrives      Detect changes          Insert new rows
  in source table      by key and columns     with timestamps and flags
Myth Busters - 4 Common Misconceptions
Quick: Do snapshot tables automatically delete old versions to save space? Commit yes or no.
Common Belief:Snapshot tables keep only the latest version of each record to save storage.
Tap to reveal reality
Reality:Snapshot tables store every version of a record that changed, growing over time unless managed.
Why it matters:Assuming snapshots prune old data leads to unexpected storage bloat and slower queries.
Quick: Do you think dbt snapshots detect changes by comparing all columns by default? Commit yes or no.
Common Belief:dbt snapshots always compare every column in the source table to detect changes.
Tap to reveal reality
Reality:dbt snapshots compare only specified columns or use a timestamp column, not all columns by default.
Why it matters:Not specifying columns can cause missed changes or unnecessary snapshot rows.
Quick: Can snapshot tables replace all types of data versioning needs? Commit yes or no.
Common Belief:Snapshot tables are a one-size-fits-all solution for all historical data tracking.
Tap to reveal reality
Reality:Snapshots are best for slowly changing dimensions but not ideal for high-frequency change capture or complex versioning like full change data capture.
Why it matters:Misusing snapshots for unsuitable cases can cause performance issues and inaccurate history.
Quick: Do you think snapshot tables require complex ETL pipelines to maintain? Commit yes or no.
Common Belief:Maintaining snapshot tables needs complex, custom ETL jobs to track changes.
Tap to reveal reality
Reality:dbt snapshots automate change detection and history storage using simple SQL and configuration.
Why it matters:Believing snapshots need complex ETL discourages their use and adds unnecessary work.
Expert Zone
1
Snapshot tables rely heavily on the uniqueness and stability of the primary key; changing keys can break history tracking.
2
Choosing the right snapshot strategy ('check' vs 'timestamp') affects both accuracy and performance, especially with large datasets.
3
Metadata columns like 'dbt_valid_from' and 'dbt_valid_to' enable complex temporal queries but require careful handling to avoid confusion.
When NOT to use
Snapshot tables are not suitable for capturing every single data change in real-time or for high-frequency event streams. For those cases, use Change Data Capture (CDC) systems or event sourcing. Also, if your data changes rarely and history is not needed, simple overwrite models are better.
Production Patterns
In production, snapshot tables are often combined with partitioning and incremental models to manage size. Teams use snapshots to implement type 2 slowly changing dimensions in dimensional models. Snapshots also support audit trails and compliance by preserving data history automatically during dbt runs.
Connections
Slowly Changing Dimensions (SCD)
Snapshot tables implement type 2 SCD by storing full history of changes.
Understanding snapshots clarifies how historical attribute changes are tracked in data warehouses.
Change Data Capture (CDC)
Snapshots and CDC both track data changes but CDC captures every event in real-time, while snapshots capture changes at batch intervals.
Knowing the difference helps choose the right tool for data versioning needs.
Version Control Systems (e.g., Git)
Both snapshot tables and version control systems store history of changes over time.
Seeing snapshots as version control for data helps understand their purpose and design.
Common Pitfalls
#1Not specifying columns to track changes causes missed updates.
Wrong approach:snapshots: - name: customer_snapshot strategy: check unique_key: customer_id # forgot to specify 'check_cols'
Correct approach:snapshots: - name: customer_snapshot strategy: check unique_key: customer_id check_cols: ['email', 'status']
Root cause:Assuming dbt snapshots compare all columns by default leads to missing changes.
#2Using snapshot tables for high-frequency event data causes performance issues.
Wrong approach:Creating snapshots on a streaming events table with thousands of changes per minute.
Correct approach:Use a dedicated CDC or event streaming system for high-frequency data; reserve snapshots for slower-changing dimension tables.
Root cause:Misunderstanding snapshot tables' batch nature and storage growth.
#3Ignoring snapshot table growth leads to slow queries and high costs.
Wrong approach:Never archiving or partitioning snapshot tables, letting them grow indefinitely.
Correct approach:Implement partitioning by date and archive old snapshot data regularly.
Root cause:Not planning for data volume growth in historical tables.
Key Takeaways
Snapshot tables store every change in your data, creating a full history over time.
dbt automates snapshot creation by detecting changes and adding new rows with timestamps.
Choosing the right snapshot strategy and columns to track is essential for accuracy and efficiency.
Snapshots support slowly changing dimensions but are not suited for real-time or high-frequency change capture.
Managing snapshot table size with partitioning and archiving is critical for production use.