Apache Sparkdata~10 mins

Reading JSON and nested data in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Reading JSON and nested data

Start: JSON file path

↓

Read JSON with Spark

↓

Create DataFrame

↓

Inspect schema

↓

Access nested fields

↓

Show data or use nested fields

The flow starts by reading a JSON file into a Spark DataFrame, then inspecting and accessing nested fields step-by-step.

Execution Sample

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json('data.json')
df.printSchema()
df = df.select('name', 'address.city')
df.show()

This code reads a JSON file with nested data, prints its schema, and shows selected nested fields.

Execution Table

Step	Action	Input/Condition	Result/Output
1	Create SparkSession	No input	SparkSession object created
2	Read JSON file	File path 'data.json'	DataFrame with nested schema created
3	Print schema	DataFrame	Schema shows fields: name (string), address (struct) with city, state
4	Select fields	Select 'name' and 'address.city'	DataFrame with columns: name, city
5	Show data	Selected columns	Table output with names and city values
6	End	All steps done	Data displayed, execution complete

💡 All JSON data read and nested fields accessed successfully.

Variable Tracker

Variable	Start	After Step 2	After Step 4	Final
spark	None	SparkSession object	SparkSession object	SparkSession object
df	None	DataFrame with nested JSON data	DataFrame with selected columns	DataFrame with selected columns

Key Moments - 3 Insights

How do we know the structure of nested JSON fields?

How to access a nested field like city inside address?

What if the nested field does not exist in some rows?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what does step 3 output?

AA DataFrame with selected columns

BThe schema of the DataFrame showing nested fields

CThe JSON file content printed as text

DAn error message about missing fields

Concept Snapshot

Read JSON with spark.read.json('file')
Use df.printSchema() to see nested structure
Access nested fields with dot notation: df.select('parent.child')
Show or use nested data safely
Spark handles missing nested fields as null

Full Transcript

This visual trace shows how to read JSON files with nested data in Apache Spark. First, a SparkSession is created. Then, spark.read.json reads the JSON file into a DataFrame. Using df.printSchema(), we inspect the nested structure, seeing fields like 'address' with subfields like 'city'. We select nested fields using dot notation, for example 'address.city', and display the data. The variable tracker shows how the DataFrame changes from raw JSON to selected columns. Key moments clarify how to understand schema and access nested data. The quizzes test understanding of schema output, nested field access, and schema differences for flat JSON. This step-by-step trace helps beginners see exactly how Spark handles nested JSON data.