0
0
dbtdata~15 mins

Source freshness checks in dbt - Deep Dive

Choose your learning style9 modes available
Overview - Source freshness checks
What is it?
Source freshness checks are a way to monitor how up-to-date your data sources are. They help you know if the data you use in your analysis or reports is fresh or outdated. This is done by checking timestamps or update times on the original data before it enters your data models. It ensures you trust the data's timeliness.
Why it matters
Without source freshness checks, you might make decisions based on old or stale data without realizing it. This can lead to wrong conclusions or missed opportunities. Freshness checks give you confidence that your data reflects the latest information, which is crucial for accurate analysis and business decisions.
Where it fits
Before learning source freshness checks, you should understand basic dbt concepts like models, sources, and tests. After mastering freshness checks, you can explore advanced data quality monitoring and alerting systems to automate data reliability.
Mental Model
Core Idea
Source freshness checks measure how recent the data in your source tables is to ensure your analysis uses up-to-date information.
Think of it like...
It's like checking the expiration date on food before eating it to make sure it's still good and safe.
┌───────────────────────────────┐
│        Source Table           │
│  ┌─────────────────────────┐  │
│  │ Timestamp of last update│  │
│  └─────────────────────────┘  │
│               │               │
│               ▼               │
│  Freshness Check compares     │
│  current time with timestamp  │
│               │               │
│               ▼               │
│  Result: Fresh or Stale       │
└───────────────────────────────┘
Build-Up - 6 Steps
1
FoundationUnderstanding data freshness basics
🤔
Concept: Learn what data freshness means and why it matters in data analysis.
Data freshness refers to how recently data was updated or loaded. Fresh data means it reflects the latest changes, while stale data is old and may not be reliable. For example, a sales report using yesterday's data might miss today's sales. Freshness is often tracked using timestamps that record when data was last updated.
Result
You understand that freshness is about data recency and its impact on trustworthiness.
Knowing what freshness means helps you realize why checking data update times is important before analysis.
2
FoundationIntroduction to dbt sources and timestamps
🤔
Concept: Learn how dbt defines sources and how timestamps are used to track data updates.
In dbt, sources represent raw tables from your database. Each source can have metadata, including a column that stores the last update time, like 'updated_at'. This timestamp is key to checking freshness. You configure sources in dbt's YAML files and specify which column holds the update time.
Result
You can identify and configure source tables with update timestamps in dbt.
Understanding sources and timestamps in dbt sets the stage for automating freshness checks.
3
IntermediateConfiguring freshness checks in dbt
🤔Before reading on: Do you think freshness checks run automatically or need explicit setup in dbt? Commit to your answer.
Concept: Learn how to set up freshness checks in dbt by specifying freshness criteria in source configurations.
In dbt, you add a 'freshness' block under your source in the YAML file. Here, you define how recent the data should be, like 'warn_after' 1 hour and 'error_after' 2 hours. When you run 'dbt source freshness', dbt compares the current time to the latest timestamp in the source and reports if data is fresh, warning, or error state.
Result
You can configure and run freshness checks that tell you if your source data is fresh or stale.
Knowing that freshness checks require explicit configuration helps you control data quality actively.
4
IntermediateInterpreting freshness check results
🤔Before reading on: Do you think a 'warn' status means data is unusable or just a caution? Commit to your answer.
Concept: Learn how to read and act on the results of freshness checks in dbt.
When you run freshness checks, dbt outputs statuses: 'pass' means data is fresh, 'warn' means data is older than the warning threshold but still usable, and 'error' means data is too old and likely unreliable. You can use these statuses to decide if you should pause reports or investigate data delays.
Result
You understand how to interpret freshness statuses and their impact on data trust.
Recognizing the difference between warnings and errors helps prioritize data quality actions.
5
AdvancedAutomating freshness checks in CI/CD pipelines
🤔Before reading on: Do you think freshness checks can run automatically in deployment pipelines? Commit to your answer.
Concept: Learn how to integrate freshness checks into automated workflows for continuous data quality monitoring.
You can add 'dbt source freshness' commands to your CI/CD pipelines or scheduled jobs. This way, freshness checks run regularly without manual effort. If freshness fails, the pipeline can alert your team or stop downstream jobs, preventing stale data from reaching reports or dashboards.
Result
Your data freshness is monitored continuously, reducing risk of stale data use.
Automating freshness checks ensures timely detection of data issues and supports reliable data delivery.
6
ExpertHandling complex freshness scenarios and limitations
🤔Before reading on: Do you think freshness checks always guarantee data quality? Commit to your answer.
Concept: Explore challenges like irregular update patterns, timezone issues, and limitations of freshness checks.
Freshness checks rely on timestamps, but some sources update irregularly or lack reliable timestamps. Timezone differences can cause false alerts if not handled properly. Also, freshness does not check data correctness, only recency. Experts combine freshness with other tests and monitoring to ensure full data quality.
Result
You know when freshness checks might give false signals and how to mitigate them.
Understanding freshness limitations prevents overreliance and encourages comprehensive data quality strategies.
Under the Hood
dbt runs a query on the source table to find the maximum value of the timestamp column specified for freshness. It then compares this timestamp to the current system time. Based on configured thresholds, dbt categorizes the freshness status. This process happens at runtime when you execute 'dbt source freshness'.
Why designed this way?
This design uses existing data timestamps to avoid extra overhead or complex tracking systems. It leverages the database's native capabilities for efficient checks. Alternatives like event-based freshness tracking are more complex and less portable across systems.
┌───────────────┐      ┌─────────────────────┐      ┌─────────────┐
│ Source Table  │─────▶│ Query max(timestamp) │─────▶│ Compare to  │
│ with updated  │      │                     │      │ current time│
│ timestamp col │      └─────────────────────┘      └─────────────┘
│               │                                      │
└───────────────┘                                      ▼
                                                  ┌─────────────┐
                                                  │ Freshness   │
                                                  │ Status:     │
                                                  │ pass/warn/  │
                                                  │ error       │
                                                  └─────────────┘
Myth Busters - 4 Common Misconceptions
Quick: Does a passing freshness check guarantee the data is accurate? Commit yes or no.
Common Belief:If freshness checks pass, the data is always correct and reliable.
Tap to reveal reality
Reality:Freshness checks only confirm data is recent, not that it is accurate or complete.
Why it matters:Relying solely on freshness can let incorrect or incomplete data slip through, causing wrong decisions.
Quick: Do freshness checks run automatically without setup in dbt? Commit yes or no.
Common Belief:dbt automatically checks freshness for all sources without extra configuration.
Tap to reveal reality
Reality:You must explicitly configure freshness checks in dbt source YAML files; they do not run by default.
Why it matters:Assuming automatic checks can lead to missing stale data issues because no checks were actually set up.
Quick: Can freshness checks detect data errors like wrong values? Commit yes or no.
Common Belief:Freshness checks will catch any data errors or inconsistencies.
Tap to reveal reality
Reality:Freshness checks only measure update recency, not data correctness or integrity.
Why it matters:Mistaking freshness for full data quality can cause overlooked errors that impact analysis.
Quick: Does timezone difference never affect freshness check results? Commit yes or no.
Common Belief:Timezone differences do not impact freshness check accuracy.
Tap to reveal reality
Reality:If timestamps and system times are in different timezones without adjustment, freshness checks can give false warnings.
Why it matters:Ignoring timezone issues can cause unnecessary alerts or missed stale data detection.
Expert Zone
1
Freshness thresholds should reflect business context; some data can tolerate longer delays without harm.
2
Combining freshness checks with other tests like uniqueness or null checks provides a fuller data quality picture.
3
Handling timezone-aware timestamps correctly avoids false positives in freshness alerts, especially in global systems.
When NOT to use
Do not rely on freshness checks alone when data correctness or completeness is critical; use additional data quality tests and validation tools. For event-driven or streaming data, consider real-time monitoring systems instead.
Production Patterns
In production, teams integrate freshness checks into nightly batch jobs and alerting systems. They often combine freshness with SLA monitoring and use dashboards to track data health over time.
Connections
Data Quality Testing
Builds-on
Understanding freshness checks helps grasp broader data quality testing, which includes accuracy, completeness, and consistency.
Continuous Integration/Continuous Deployment (CI/CD)
Same pattern
Automating freshness checks in CI/CD pipelines applies the same principles of automated testing and deployment used in software engineering.
Supply Chain Inventory Management
Analogous process
Just like freshness checks ensure data is up-to-date, inventory management tracks stock freshness to avoid selling expired goods, showing a cross-domain pattern of freshness monitoring.
Common Pitfalls
#1Not configuring freshness checks in dbt source files.
Wrong approach:sources: - name: sales_data tables: - name: daily_sales # Missing freshness block here
Correct approach:sources: - name: sales_data tables: - name: daily_sales freshness: warn_after: {count: 1, period: hour} error_after: {count: 2, period: hour} loaded_at_field: updated_at
Root cause:Assuming freshness checks run automatically without explicit configuration.
#2Using incorrect timestamp column for freshness check.
Wrong approach:freshness: loaded_at_field: created_date # This column does not update on data refresh
Correct approach:freshness: loaded_at_field: updated_at # Correct column that reflects last update time
Root cause:Confusing creation time with last update time, leading to stale data not detected.
#3Ignoring timezone differences in timestamp comparisons.
Wrong approach:# Timestamps stored in UTC but system time local freshness: loaded_at_field: updated_at warn_after: {count: 1, period: hour} error_after: {count: 2, period: hour}
Correct approach:# Ensure timestamps and system time use same timezone or convert # Handle timezone in source or dbt config explicitly
Root cause:Not accounting for timezone mismatch causes false freshness alerts.
Key Takeaways
Source freshness checks verify how recent your data is to ensure timely and reliable analysis.
In dbt, freshness checks require explicit configuration specifying which timestamp column to use and freshness thresholds.
Freshness checks only measure data recency, not correctness or completeness, so they should be combined with other data quality tests.
Automating freshness checks in pipelines helps catch stale data early and maintain trust in your data systems.
Understanding limitations like timezone issues and irregular updates prevents false alerts and improves monitoring accuracy.