0
0
Apache Sparkdata~10 mins

Reading JSON and nested data in Apache Spark - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Reading JSON and nested data
Start: JSON file path
Read JSON with Spark
Create DataFrame
Inspect schema
Access nested fields
Show data or use nested fields
The flow starts by reading a JSON file into a Spark DataFrame, then inspecting and accessing nested fields step-by-step.
Execution Sample
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.json('data.json')
df.printSchema()
df = df.select('name', 'address.city')
df.show()
This code reads a JSON file with nested data, prints its schema, and shows selected nested fields.
Execution Table
StepActionInput/ConditionResult/Output
1Create SparkSessionNo inputSparkSession object created
2Read JSON fileFile path 'data.json'DataFrame with nested schema created
3Print schemaDataFrameSchema shows fields: name (string), address (struct) with city, state
4Select fieldsSelect 'name' and 'address.city'DataFrame with columns: name, city
5Show dataSelected columnsTable output with names and city values
6EndAll steps doneData displayed, execution complete
💡 All JSON data read and nested fields accessed successfully.
Variable Tracker
VariableStartAfter Step 2After Step 4Final
sparkNoneSparkSession objectSparkSession objectSparkSession object
dfNoneDataFrame with nested JSON dataDataFrame with selected columnsDataFrame with selected columns
Key Moments - 3 Insights
How do we know the structure of nested JSON fields?
By using df.printSchema() (see execution_table step 3), we see the nested fields and their types clearly.
How to access a nested field like city inside address?
Use dot notation like 'address.city' in df.select() as shown in step 4 of the execution_table.
What if the nested field does not exist in some rows?
Spark will show null for those rows in the selected nested column, allowing safe access without errors.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what does step 3 output?
AA DataFrame with selected columns
BThe schema of the DataFrame showing nested fields
CThe JSON file content printed as text
DAn error message about missing fields
💡 Hint
Refer to execution_table row with Step 3 describing printSchema output.
At which step is the nested field 'address.city' accessed?
AStep 4
BStep 3
CStep 2
DStep 5
💡 Hint
Check execution_table row Step 4 where select('name', 'address.city') is done.
If the JSON file had no nested fields, how would the schema print differ?
AIt would show nested fields as arrays
BIt would show an error in printSchema
CIt would show only top-level fields without structs
DIt would not create a DataFrame
💡 Hint
Think about how printSchema shows nested structs vs flat fields (see step 3).
Concept Snapshot
Read JSON with spark.read.json('file')
Use df.printSchema() to see nested structure
Access nested fields with dot notation: df.select('parent.child')
Show or use nested data safely
Spark handles missing nested fields as null
Full Transcript
This visual trace shows how to read JSON files with nested data in Apache Spark. First, a SparkSession is created. Then, spark.read.json reads the JSON file into a DataFrame. Using df.printSchema(), we inspect the nested structure, seeing fields like 'address' with subfields like 'city'. We select nested fields using dot notation, for example 'address.city', and display the data. The variable tracker shows how the DataFrame changes from raw JSON to selected columns. Key moments clarify how to understand schema and access nested data. The quizzes test understanding of schema output, nested field access, and schema differences for flat JSON. This step-by-step trace helps beginners see exactly how Spark handles nested JSON data.