Overview - Snapshot tables for historical tracking

What is it?

Snapshot tables are special tables that keep a history of changes in your data over time. Instead of just showing the latest data, they store every version of a record when it changes. This helps you track how data looked at different points in the past. In dbt, snapshot tables automate this process by capturing changes during each run.

Why it matters

Without snapshot tables, you lose the story of how your data evolved. This makes it hard to analyze trends, audit changes, or fix mistakes. Snapshot tables let you answer questions like 'What was the status last month?' or 'When did this value change?'. They make your data trustworthy and useful for historical analysis.

Where it fits

Before learning snapshot tables, you should understand basic SQL and dbt models. After mastering snapshots, you can explore advanced data versioning, slowly changing dimensions, and time travel queries in data warehouses.

Mental Model

Core Idea

Snapshot tables capture and store every change in your data over time, creating a timeline of historical records.

Think of it like...

Imagine taking a photo of your desk every day. Each photo shows how your desk looked that day, even if you moved things around later. Snapshot tables are like those daily photos for your data.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Original Data │──────▶│ Snapshot Run 1│──────▶│ Snapshot Table│
│ (Current)    │       │ (Detects diff)│       │ (Stores state)│
└───────────────┘       └───────────────┘       └───────────────┘

Each run compares current data to last snapshot and stores changes.

Build-Up - 7 Steps

1

FoundationUnderstanding data changes over time

Concept: Data changes and why tracking history matters.

Data in databases often changes: prices update, statuses shift, or user info edits. Without saving old versions, you only see the latest state. This loses valuable history needed for analysis, audits, or debugging.

Result

You realize that just current data is not enough for many real-world questions.

Understanding that data evolves helps you see why keeping history is important.

2

FoundationWhat snapshot tables do

3

IntermediateHow dbt snapshot works

4

IntermediateTypes of snapshot strategies

5

IntermediateSnapshot table schema and metadata

6

AdvancedHandling slowly changing dimensions with snapshots

7

ExpertPerformance and storage considerations

Under the Hood

dbt snapshots run SQL queries that compare current source data to the last snapshot state using keys and tracked columns. When differences are found, dbt inserts new rows with updated data and timestamps. It uses metadata columns to mark validity periods and current versions. This process happens inside your data warehouse, leveraging SQL for change detection and storage.

Why designed this way?

Snapshots were designed to automate historical tracking without complex ETL pipelines. Using SQL and metadata columns fits well with data warehouse architectures. The design balances simplicity, auditability, and performance. Alternatives like full change data capture require more infrastructure and complexity, which dbt snapshots avoid.

┌───────────────┐       ┌───────────────┐       ┌───────────────┐
│ Source Table  │──────▶│ dbt Snapshot  │──────▶│ Snapshot Table│
│ (Current data)│       │ (Compare data)│       │ (Store history)│
└───────────────┘       └───────────────┘       └───────────────┘
       │                      │                       │
       │                      │                       │
       ▼                      ▼                       ▼
  New data arrives      Detect changes          Insert new rows
  in source table      by key and columns     with timestamps and flags

Myth Busters - 4 Common Misconceptions

Quick: Do snapshot tables automatically delete old versions to save space? Commit yes or no.

Common Belief:Snapshot tables keep only the latest version of each record to save storage.

Tap to reveal reality

Quick: Do you think dbt snapshots detect changes by comparing all columns by default? Commit yes or no.

Common Belief:dbt snapshots always compare every column in the source table to detect changes.

Tap to reveal reality

Quick: Can snapshot tables replace all types of data versioning needs? Commit yes or no.

Common Belief:Snapshot tables are a one-size-fits-all solution for all historical data tracking.

Tap to reveal reality

Quick: Do you think snapshot tables require complex ETL pipelines to maintain? Commit yes or no.

Common Belief:Maintaining snapshot tables needs complex, custom ETL jobs to track changes.

Tap to reveal reality

Expert Zone

1

Snapshot tables rely heavily on the uniqueness and stability of the primary key; changing keys can break history tracking.

2

Choosing the right snapshot strategy ('check' vs 'timestamp') affects both accuracy and performance, especially with large datasets.

3

Metadata columns like 'dbt_valid_from' and 'dbt_valid_to' enable complex temporal queries but require careful handling to avoid confusion.

When NOT to use

Snapshot tables are not suitable for capturing every single data change in real-time or for high-frequency event streams. For those cases, use Change Data Capture (CDC) systems or event sourcing. Also, if your data changes rarely and history is not needed, simple overwrite models are better.

Production Patterns

In production, snapshot tables are often combined with partitioning and incremental models to manage size. Teams use snapshots to implement type 2 slowly changing dimensions in dimensional models. Snapshots also support audit trails and compliance by preserving data history automatically during dbt runs.

Connections

Slowly Changing Dimensions (SCD)

Snapshot tables implement type 2 SCD by storing full history of changes.

Understanding snapshots clarifies how historical attribute changes are tracked in data warehouses.

Change Data Capture (CDC)

Snapshots and CDC both track data changes but CDC captures every event in real-time, while snapshots capture changes at batch intervals.

Knowing the difference helps choose the right tool for data versioning needs.

Version Control Systems (e.g., Git)

Both snapshot tables and version control systems store history of changes over time.

Seeing snapshots as version control for data helps understand their purpose and design.

Common Pitfalls

#1Not specifying columns to track changes causes missed updates.

Wrong approach:snapshots: - name: customer_snapshot strategy: check unique_key: customer_id # forgot to specify 'check_cols'

Correct approach:snapshots: - name: customer_snapshot strategy: check unique_key: customer_id check_cols: ['email', 'status']

Root cause:Assuming dbt snapshots compare all columns by default leads to missing changes.

#2Using snapshot tables for high-frequency event data causes performance issues.

Wrong approach:Creating snapshots on a streaming events table with thousands of changes per minute.

Correct approach:Use a dedicated CDC or event streaming system for high-frequency data; reserve snapshots for slower-changing dimension tables.

Root cause:Misunderstanding snapshot tables' batch nature and storage growth.

#3Ignoring snapshot table growth leads to slow queries and high costs.

Wrong approach:Never archiving or partitioning snapshot tables, letting them grow indefinitely.

Correct approach:Implement partitioning by date and archive old snapshot data regularly.

Root cause:Not planning for data volume growth in historical tables.

Key Takeaways

Snapshot tables store every change in your data, creating a full history over time.

dbt automates snapshot creation by detecting changes and adding new rows with timestamps.

Choosing the right snapshot strategy and columns to track is essential for accuracy and efficiency.

Snapshots support slowly changing dimensions but are not suited for real-time or high-frequency change capture.

Managing snapshot table size with partitioning and archiving is critical for production use.