Overview - Semi-structured data handling (JSON)

What is it?

Semi-structured data is a type of data that does not fit neatly into tables but still has some organization, like JSON. JSON (JavaScript Object Notation) stores data in key-value pairs and nested structures, making it flexible. Handling JSON means extracting and transforming this data so it can be analyzed or stored in databases. dbt helps transform JSON data inside your data warehouse using SQL and built-in functions.

Why it matters

Many modern data sources like APIs, logs, and event streams produce JSON data. Without tools to handle JSON, this data would be hard to analyze or combine with traditional tables. If we ignored JSON, we would miss insights hidden in flexible data formats and waste valuable information. Handling JSON well lets businesses unlock rich, detailed data for smarter decisions.

Where it fits

Before learning this, you should understand basic SQL and relational databases. After mastering JSON handling, you can learn advanced data modeling, performance tuning, and integrating APIs with dbt. This topic bridges raw data ingestion and clean, usable analytics-ready tables.

Mental Model

Core Idea

Semi-structured JSON data is like a flexible container of nested information that needs to be unpacked and flattened to fit into structured tables for analysis.

Think of it like...

Imagine a toolbox with many compartments inside, each holding different tools and parts. To use the tools effectively, you need to open the compartments and organize the parts on a workbench. JSON is the toolbox, and handling it means unpacking and arranging the parts for easy use.

JSON Data Structure
┌─────────────┐
│ Root Object │
├─────────────┤
│ Key: Value  │
│ Key: Object │───┐
│ Key: Array  │───┼─ Nested structures
└─────────────┘   │
                  ▼
           ┌─────────────┐
           │ Nested Keys │
           │ Nested Array│
           └─────────────┘

Build-Up - 7 Steps

1

FoundationUnderstanding JSON Basics

Concept: Learn what JSON is and how its structure works with keys, values, objects, and arrays.

JSON stores data as text with pairs like "key": value. Values can be simple (strings, numbers) or complex (objects, arrays). Objects are like dictionaries with keys and values. Arrays are ordered lists of values. This flexible format allows nesting data inside data.

Result

You can read and recognize JSON data structures and understand their components.

Understanding JSON's flexible structure is essential because it explains why JSON data needs special handling compared to flat tables.

2

FoundationBasics of dbt and SQL for JSON

3

IntermediateExtracting Nested JSON Fields

4

IntermediateFlattening JSON Arrays with dbt

5

IntermediateHandling Missing or Optional JSON Fields

6

AdvancedOptimizing JSON Transformations in dbt

7

ExpertAdvanced Nested JSON and Schema Evolution

Under the Hood

When dbt runs a model with JSON data, it sends SQL queries to the data warehouse. The warehouse parses JSON strings into internal JSON types or text. Functions like JSON_EXTRACT navigate the JSON tree to return values. Flattening arrays uses set-returning functions that expand nested lists into rows. dbt compiles models into SQL and manages dependencies to build tables incrementally.

Why designed this way?

JSON was designed for flexible data exchange, not rigid tables. Warehouses added JSON support to handle modern data sources without forcing schema upfront. dbt builds on this by letting analysts write SQL transformations declaratively, making JSON handling accessible and maintainable. This design balances flexibility with the power of SQL analytics.

dbt Model Execution Flow
┌───────────────┐
│ dbt Compiles  │
│ SQL with JSON │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Data Warehouse│
│ Parses JSON   │
│ Executes SQL  │
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ JSON Functions│
│ Extract/Flatten│
└──────┬────────┘
       │
       ▼
┌───────────────┐
│ Result Tables │
│ Structured    │
│ Data Ready    │
└───────────────┘

Myth Busters - 4 Common Misconceptions

Quick: Can you treat JSON data exactly like a normal SQL table column? Commit yes or no.

Common Belief:JSON data is just like any other column and can be queried directly without special functions.

Tap to reveal reality

Quick: Does flattening a JSON array always keep the original row count? Commit yes or no.

Common Belief:Flattening JSON arrays does not change the number of rows in the result.

Tap to reveal reality

Quick: Is it safe to assume JSON schemas never change in production? Commit yes or no.

Common Belief:JSON data schemas are fixed and stable over time.

Tap to reveal reality

Quick: Can you always use the same JSON extraction function across all warehouses? Commit yes or no.

Common Belief:JSON functions are standardized and work the same in every SQL warehouse.

Tap to reveal reality

Expert Zone

1

Some warehouses store JSON as native types with indexing, making queries faster, while others treat JSON as plain text, affecting performance.

2

Using dbt macros to abstract JSON extraction lets you write portable code across different warehouses with varying JSON syntax.

3

Incremental dbt models combined with JSON flattening can drastically reduce processing time by only handling new data.

When NOT to use

If JSON data is simple and flat, converting it to structured tables before dbt transformations might be easier. For extremely large or deeply nested JSON, consider specialized ETL tools or JSON-specific databases instead of pure SQL transformations.

Production Patterns

In production, teams use dbt to build layered models: raw JSON ingestion tables, intermediate flattened tables, and final analytics tables. They use tests to catch schema changes and macros to handle JSON differences across environments. Incremental runs and snapshots help manage large JSON datasets efficiently.

Connections

NoSQL Databases

Related data storage format

Understanding JSON handling in dbt helps bridge SQL analytics with NoSQL databases like MongoDB that store data as JSON documents.

API Data Integration

Builds on JSON data exchange

APIs often return JSON; mastering JSON transformations in dbt enables seamless integration of API data into analytics workflows.

XML Data Processing

Similar semi-structured data handling

Techniques for parsing and flattening JSON share concepts with XML processing, broadening skills in handling semi-structured data formats.

Common Pitfalls

#1Trying to select nested JSON keys directly without extraction functions.

Wrong approach:SELECT json_column.address.city FROM table;

Correct approach:SELECT json_column->'address'->>'city' AS city FROM table;

Root cause:Misunderstanding that JSON is stored as text and requires functions to access nested data.

#2Flattening JSON arrays but forgetting it multiplies rows, causing incorrect joins.

Wrong approach:SELECT * FROM table JOIN UNNEST(json_column.array) AS item ON table.id = item.id;

Correct approach:Use proper keys to join or aggregate after flattening to avoid row explosion.

Root cause:Not realizing flattening expands data and changes row counts.

#3Ignoring missing JSON fields and not handling nulls, causing query errors.

Wrong approach:SELECT json_column->>'optional_field' AS field FROM table WHERE field = 'value';

Correct approach:SELECT COALESCE(json_column->>'optional_field', 'default') AS field FROM table WHERE field = 'value';

Root cause:Assuming all JSON keys exist in every record.

Key Takeaways

JSON is a flexible, nested data format that requires special SQL functions to extract and flatten for analysis.

dbt enables transforming JSON data inside data warehouses by writing SQL models that unpack and organize JSON fields.

Flattening JSON arrays expands data into multiple rows, which is powerful but requires careful handling to avoid errors.

Handling missing or evolving JSON schemas is essential for building robust, production-ready data pipelines.

Performance optimization and abstraction with dbt macros make JSON transformations scalable and maintainable.