Apache Sparkdata~30 mins

Reading JSON and nested data in Apache Spark - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Reading JSON and Nested Data

📖 Scenario: You work as a data analyst for a company that collects customer information in JSON format. The data includes nested details like addresses and orders. You need to read this JSON data using Apache Spark and extract useful information.

🎯 Goal: Learn how to read JSON data with nested structures in Apache Spark and extract specific fields into a DataFrame.

📋 What You'll Learn

Read JSON data from a string using Spark

Access nested fields inside the JSON

Create a DataFrame with selected columns

Display the extracted data

💡 Why This Matters

🌍 Real World

Companies often receive data in JSON format with nested details. Being able to read and extract this data using Spark helps analyze customer information, logs, or events efficiently.

💼 Career

Data engineers and data scientists frequently work with JSON data in big data platforms like Spark. This skill is essential for preparing data for analysis or machine learning.

Progress0 / 4 steps

Create JSON data string

Create a variable called json_data that contains this exact JSON string:

[{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}, {"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}]

Apache Spark

# Create a variable called json_data with the given JSON string
# Your code here

Need a hint?

Use single quotes around the whole string and double quotes inside for JSON format.

Create Spark session

Create a Spark session variable called spark using SparkSession.builder.appName("JSONReader").getOrCreate()

Apache Spark

json_data = '[{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}, {"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}]'
# Create a Spark session called spark
# Your code here

Need a hint?

Import SparkSession from pyspark.sql before creating the session.

Read JSON string into DataFrame

Use spark.read.json() with spark.sparkContext.parallelize([json_data]) to create a DataFrame called df from the JSON string

Apache Spark

from pyspark.sql import SparkSession

json_data = '[{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}, {"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}]'
spark = SparkSession.builder.appName("JSONReader").getOrCreate()
# Create a DataFrame called df by reading the JSON string
# Your code here

Need a hint?

Use parallelize to convert the JSON string into an RDD before reading.

Select and show nested data

Select the columns name, age, and the nested field address.city from df into a new DataFrame called result. Then print the contents of result using result.show()

Apache Spark

from pyspark.sql import SparkSession

json_data = '[{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}, {"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}]'
spark = SparkSession.builder.appName("JSONReader").getOrCreate()
df = spark.read.json(spark.sparkContext.parallelize([json_data]))
# Select name, age, and address.city into result and show it
# Your code here

Need a hint?

Use df.select("name", "age", "address.city") to access nested fields.