Bird
Raised Fist0
dbtdata~15 mins

Source freshness checks in dbt - Deep Dive

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Source freshness checks
What is it?
Source freshness checks are a way to monitor how up-to-date your data sources are. They help you know if the data you use in your analysis or reports is fresh or outdated. This is done by checking timestamps or update times on the original data before it enters your data models. It ensures you trust the data's timeliness.
Why it matters
Without source freshness checks, you might make decisions based on old or stale data without realizing it. This can lead to wrong conclusions or missed opportunities. Freshness checks give you confidence that your data reflects the latest information, which is crucial for accurate analysis and business decisions.
Where it fits
Before learning source freshness checks, you should understand basic dbt concepts like models, sources, and tests. After mastering freshness checks, you can explore advanced data quality monitoring and alerting systems to automate data reliability.
Mental Model
Core Idea
Source freshness checks measure how recent the data in your source tables is to ensure your analysis uses up-to-date information.
Think of it like...
It's like checking the expiration date on food before eating it to make sure it's still good and safe.
┌───────────────────────────────┐
│        Source Table           │
│  ┌─────────────────────────┐  │
│  │ Timestamp of last update│  │
│  └─────────────────────────┘  │
│               │               │
│               ▼               │
│  Freshness Check compares     │
│  current time with timestamp  │
│               │               │
│               ▼               │
│  Result: Fresh or Stale       │
└───────────────────────────────┘
Build-Up - 6 Steps
1
FoundationUnderstanding data freshness basics
🤔
Concept: Learn what data freshness means and why it matters in data analysis.
Data freshness refers to how recently data was updated or loaded. Fresh data means it reflects the latest changes, while stale data is old and may not be reliable. For example, a sales report using yesterday's data might miss today's sales. Freshness is often tracked using timestamps that record when data was last updated.
Result
You understand that freshness is about data recency and its impact on trustworthiness.
Knowing what freshness means helps you realize why checking data update times is important before analysis.
2
FoundationIntroduction to dbt sources and timestamps
🤔
Concept: Learn how dbt defines sources and how timestamps are used to track data updates.
In dbt, sources represent raw tables from your database. Each source can have metadata, including a column that stores the last update time, like 'updated_at'. This timestamp is key to checking freshness. You configure sources in dbt's YAML files and specify which column holds the update time.
Result
You can identify and configure source tables with update timestamps in dbt.
Understanding sources and timestamps in dbt sets the stage for automating freshness checks.
3
IntermediateConfiguring freshness checks in dbt
🤔Before reading on: Do you think freshness checks run automatically or need explicit setup in dbt? Commit to your answer.
Concept: Learn how to set up freshness checks in dbt by specifying freshness criteria in source configurations.
In dbt, you add a 'freshness' block under your source in the YAML file. Here, you define how recent the data should be, like 'warn_after' 1 hour and 'error_after' 2 hours. When you run 'dbt source freshness', dbt compares the current time to the latest timestamp in the source and reports if data is fresh, warning, or error state.
Result
You can configure and run freshness checks that tell you if your source data is fresh or stale.
Knowing that freshness checks require explicit configuration helps you control data quality actively.
4
IntermediateInterpreting freshness check results
🤔Before reading on: Do you think a 'warn' status means data is unusable or just a caution? Commit to your answer.
Concept: Learn how to read and act on the results of freshness checks in dbt.
When you run freshness checks, dbt outputs statuses: 'pass' means data is fresh, 'warn' means data is older than the warning threshold but still usable, and 'error' means data is too old and likely unreliable. You can use these statuses to decide if you should pause reports or investigate data delays.
Result
You understand how to interpret freshness statuses and their impact on data trust.
Recognizing the difference between warnings and errors helps prioritize data quality actions.
5
AdvancedAutomating freshness checks in CI/CD pipelines
🤔Before reading on: Do you think freshness checks can run automatically in deployment pipelines? Commit to your answer.
Concept: Learn how to integrate freshness checks into automated workflows for continuous data quality monitoring.
You can add 'dbt source freshness' commands to your CI/CD pipelines or scheduled jobs. This way, freshness checks run regularly without manual effort. If freshness fails, the pipeline can alert your team or stop downstream jobs, preventing stale data from reaching reports or dashboards.
Result
Your data freshness is monitored continuously, reducing risk of stale data use.
Automating freshness checks ensures timely detection of data issues and supports reliable data delivery.
6
ExpertHandling complex freshness scenarios and limitations
🤔Before reading on: Do you think freshness checks always guarantee data quality? Commit to your answer.
Concept: Explore challenges like irregular update patterns, timezone issues, and limitations of freshness checks.
Freshness checks rely on timestamps, but some sources update irregularly or lack reliable timestamps. Timezone differences can cause false alerts if not handled properly. Also, freshness does not check data correctness, only recency. Experts combine freshness with other tests and monitoring to ensure full data quality.
Result
You know when freshness checks might give false signals and how to mitigate them.
Understanding freshness limitations prevents overreliance and encourages comprehensive data quality strategies.
Under the Hood
dbt runs a query on the source table to find the maximum value of the timestamp column specified for freshness. It then compares this timestamp to the current system time. Based on configured thresholds, dbt categorizes the freshness status. This process happens at runtime when you execute 'dbt source freshness'.
Why designed this way?
This design uses existing data timestamps to avoid extra overhead or complex tracking systems. It leverages the database's native capabilities for efficient checks. Alternatives like event-based freshness tracking are more complex and less portable across systems.
┌───────────────┐      ┌─────────────────────┐      ┌─────────────┐
│ Source Table  │─────▶│ Query max(timestamp) │─────▶│ Compare to  │
│ with updated  │      │                     │      │ current time│
│ timestamp col │      └─────────────────────┘      └─────────────┘
│               │                                      │
└───────────────┘                                      ▼
                                                  ┌─────────────┐
                                                  │ Freshness   │
                                                  │ Status:     │
                                                  │ pass/warn/  │
                                                  │ error       │
                                                  └─────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does a passing freshness check guarantee the data is accurate? Commit yes or no.
Common Belief:If freshness checks pass, the data is always correct and reliable.
Tap to reveal reality
Reality:Freshness checks only confirm data is recent, not that it is accurate or complete.
Why it matters:Relying solely on freshness can let incorrect or incomplete data slip through, causing wrong decisions.
Quick: Do freshness checks run automatically without setup in dbt? Commit yes or no.
Common Belief:dbt automatically checks freshness for all sources without extra configuration.
Tap to reveal reality
Reality:You must explicitly configure freshness checks in dbt source YAML files; they do not run by default.
Why it matters:Assuming automatic checks can lead to missing stale data issues because no checks were actually set up.
Quick: Can freshness checks detect data errors like wrong values? Commit yes or no.
Common Belief:Freshness checks will catch any data errors or inconsistencies.
Tap to reveal reality
Reality:Freshness checks only measure update recency, not data correctness or integrity.
Why it matters:Mistaking freshness for full data quality can cause overlooked errors that impact analysis.
Quick: Does timezone difference never affect freshness check results? Commit yes or no.
Common Belief:Timezone differences do not impact freshness check accuracy.
Tap to reveal reality
Reality:If timestamps and system times are in different timezones without adjustment, freshness checks can give false warnings.
Why it matters:Ignoring timezone issues can cause unnecessary alerts or missed stale data detection.
Expert Zone
1
Freshness thresholds should reflect business context; some data can tolerate longer delays without harm.
2
Combining freshness checks with other tests like uniqueness or null checks provides a fuller data quality picture.
3
Handling timezone-aware timestamps correctly avoids false positives in freshness alerts, especially in global systems.
When NOT to use
Do not rely on freshness checks alone when data correctness or completeness is critical; use additional data quality tests and validation tools. For event-driven or streaming data, consider real-time monitoring systems instead.
Production Patterns
In production, teams integrate freshness checks into nightly batch jobs and alerting systems. They often combine freshness with SLA monitoring and use dashboards to track data health over time.
Connections
Data Quality Testing
Builds-on
Understanding freshness checks helps grasp broader data quality testing, which includes accuracy, completeness, and consistency.
Continuous Integration/Continuous Deployment (CI/CD)
Same pattern
Automating freshness checks in CI/CD pipelines applies the same principles of automated testing and deployment used in software engineering.
Supply Chain Inventory Management
Analogous process
Just like freshness checks ensure data is up-to-date, inventory management tracks stock freshness to avoid selling expired goods, showing a cross-domain pattern of freshness monitoring.
Common Pitfalls
#1Not configuring freshness checks in dbt source files.
Wrong approach:sources: - name: sales_data tables: - name: daily_sales # Missing freshness block here
Correct approach:sources: - name: sales_data tables: - name: daily_sales freshness: warn_after: {count: 1, period: hour} error_after: {count: 2, period: hour} loaded_at_field: updated_at
Root cause:Assuming freshness checks run automatically without explicit configuration.
#2Using incorrect timestamp column for freshness check.
Wrong approach:freshness: loaded_at_field: created_date # This column does not update on data refresh
Correct approach:freshness: loaded_at_field: updated_at # Correct column that reflects last update time
Root cause:Confusing creation time with last update time, leading to stale data not detected.
#3Ignoring timezone differences in timestamp comparisons.
Wrong approach:# Timestamps stored in UTC but system time local freshness: loaded_at_field: updated_at warn_after: {count: 1, period: hour} error_after: {count: 2, period: hour}
Correct approach:# Ensure timestamps and system time use same timezone or convert # Handle timezone in source or dbt config explicitly
Root cause:Not accounting for timezone mismatch causes false freshness alerts.
Key Takeaways
Source freshness checks verify how recent your data is to ensure timely and reliable analysis.
In dbt, freshness checks require explicit configuration specifying which timestamp column to use and freshness thresholds.
Freshness checks only measure data recency, not correctness or completeness, so they should be combined with other data quality tests.
Automating freshness checks in pipelines helps catch stale data early and maintain trust in your data systems.
Understanding limitations like timezone issues and irregular updates prevents false alerts and improves monitoring accuracy.

Practice

(1/5)
1. What is the main purpose of source freshness checks in dbt?
easy
A. To track how recent the data in your source tables is
B. To create new tables from raw data
C. To optimize SQL query performance
D. To schedule dbt runs automatically

Solution

  1. Step 1: Understand the role of freshness checks

    Freshness checks monitor the age of data in source tables to ensure it is up-to-date.
  2. Step 2: Compare options to the purpose

    Only To track how recent the data in your source tables is describes tracking data recency, which matches the purpose of freshness checks.
  3. Final Answer:

    To track how recent the data in your source tables is -> Option A
  4. Quick Check:

    Freshness checks = track data recency [OK]
Hint: Freshness checks measure data age, not table creation or scheduling [OK]
Common Mistakes:
  • Confusing freshness checks with table creation
  • Thinking freshness checks optimize queries
  • Assuming freshness checks schedule runs
2. Which of the following is the correct way to set a freshness check with a warning threshold of 1 day and an error threshold of 2 days in dbt YAML?
easy
A. freshness: warn_after: 1 day error_after: 2 day
B. freshness: warn_after: {count: 1, period: day} error_after: {count: 2, period: day}
C. freshness: warn_after: '1 day' error_after: '2 days'
D. freshness: warn_after: {count: 2, period: day} error_after: {count: 1, period: day}

Solution

  1. Step 1: Recall correct YAML syntax for freshness

    dbt expects warn_after and error_after as objects with count and period keys.
  2. Step 2: Match options to syntax

    freshness: warn_after: {count: 1, period: day} error_after: {count: 2, period: day} correctly uses {count: X, period: day} format; others use incorrect formats or swap thresholds.
  3. Final Answer:

    freshness: warn_after: {count: 1, period: day} error_after: {count: 2, period: day} -> Option B
  4. Quick Check:

    Use count and period keys in YAML freshness [OK]
Hint: Use {count: X, period: day} format for freshness thresholds [OK]
Common Mistakes:
  • Using strings instead of objects for thresholds
  • Swapping warn_after and error_after values
  • Missing count or period keys
3. Given this freshness check result output, what is the status if the last loaded timestamp is 3 days ago, warn_after is 1 day, and error_after is 2 days?
{"status": "", "max_loaded_at": "2024-04-20T00:00:00Z"}
medium
A. error
B. warn
C. pass
D. unknown

Solution

  1. Step 1: Calculate data age from last loaded timestamp

    If today is 2024-04-23, data is 3 days old (2024-04-23 - 2024-04-20).
  2. Step 2: Compare data age to thresholds

    3 days > error_after (2 days), so status is error.
  3. Final Answer:

    error -> Option A
  4. Quick Check:

    Data age > error_after = error status [OK]
Hint: If data age > error_after, status is error [OK]
Common Mistakes:
  • Confusing warn_after and error_after thresholds
  • Assuming status is warn for data older than error_after
  • Ignoring current date when calculating age
4. You wrote this freshness check YAML but it fails to run:
sources:
  - name: my_source
    freshness:
      warn_after: {count: 1, period: day}
      error_after: {count: 2, period: days}
What is the likely cause of the error?
medium
A. The count values must be strings, not numbers
B. Missing quotes around the period values
C. warn_after and error_after keys are swapped
D. The period value 'days' should be singular 'day'

Solution

  1. Step 1: Check period values in freshness YAML

    dbt expects period values as singular strings like 'day', not plural 'days'.
  2. Step 2: Identify error cause

    Using 'days' causes a validation error; changing to 'day' fixes it.
  3. Final Answer:

    The period value 'days' should be singular 'day' -> Option D
  4. Quick Check:

    Period values must be singular like 'day' [OK]
Hint: Use singular period names like 'day', not 'days' [OK]
Common Mistakes:
  • Using plural period names
  • Swapping warn_after and error_after
  • Adding unnecessary quotes around numbers
5. You want to set up a freshness check for a source table that updates hourly. You want to warn if data is older than 2 hours and error if older than 4 hours. Which YAML snippet correctly sets this up?
hard
A. freshness: warn_after: {count: '2', period: hour} error_after: {count: '4', period: hour}
B. freshness: warn_after: {count: 2, period: hours} error_after: {count: 4, period: hours}
C. freshness: warn_after: {count: 2, period: hour} error_after: {count: 4, period: hour}
D. freshness: warn_after: {count: 4, period: hour} error_after: {count: 2, period: hour}

Solution

  1. Step 1: Identify correct period and count values

    Period should be singular 'hour', counts are numbers without quotes.
  2. Step 2: Check warn_after and error_after order

    warn_after must be less than error_after; 2 < 4 is correct.
  3. Step 3: Validate options

    freshness: warn_after: {count: 2, period: hour} error_after: {count: 4, period: hour} matches correct syntax and logic; A uses strings for counts, B uses plural 'hours', D swaps thresholds.
  4. Final Answer:

    freshness: warn_after: {count: 2, period: hour} error_after: {count: 4, period: hour} -> Option C
  5. Quick Check:

    Use singular period and correct threshold order [OK]
Hint: Use singular period and warn_after < error_after [OK]
Common Mistakes:
  • Using plural period names like 'hours'
  • Putting counts as strings instead of numbers
  • Swapping warn_after and error_after values