What is Reading JSON and nested data in Apache Spark?

Apache Sparkdata~5 mins

Reading JSON and nested data in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We use JSON files because they store data in a simple, organized way. Reading JSON helps us get this data into Spark to analyze it. Nested data means some data is inside other data, like boxes inside boxes.

You have data from web APIs that comes in JSON format.

You want to analyze user information where addresses are inside user details.

You receive logs or events stored as JSON with nested fields.

You want to load configuration files saved as JSON with multiple levels.

Syntax

Apache Spark

spark.read.json(path)

This reads a JSON file or folder into a Spark DataFrame.

Spark automatically detects nested structures and creates columns for them.

Examples

Reads a simple JSON file and shows the data.

Apache Spark

df = spark.read.json("data/simple.json")
df.show()

Reads a JSON file with nested data and prints the schema to see the structure.

Apache Spark

df = spark.read.json("data/nested.json")
df.printSchema()

Selects a top-level field and a nested field inside 'address'.

Apache Spark

df.select("name", "address.city").show()

Sample Program

This program creates a Spark session, writes sample JSON data with nested fields to a file, reads it back into a DataFrame, and shows the data and schema. It also selects a nested field to demonstrate accessing nested data.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadJSONExample").getOrCreate()

# Sample JSON data with nested structure
json_data = [
  '{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}',
  '{"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}'
]

# Save sample data to a file
with open("sample.json", "w") as f:
    for line in json_data:
        f.write(line + "\n")

# Read the JSON file
df = spark.read.json("sample.json")

# Show the data
print("DataFrame content:")
df.show()

# Show the schema to understand nested data
print("DataFrame schema:")
df.printSchema()

# Select nested field
print("Selected nested field (address.city):")
df.select("name", "address.city").show()

spark.stop()

OutputSuccess

Important Notes

Nested fields are accessed using dot notation like address.city.

Use printSchema() to understand the structure of nested JSON data.

Make sure the JSON file is accessible to Spark (local or distributed file system).

Summary

JSON files store data in a structured way, often with nested fields.

Spark can read JSON files directly and handle nested data automatically.

Use dot notation to select nested fields inside the DataFrame.