0
0
Apache Sparkdata~5 mins

Reading JSON and nested data in Apache Spark

Choose your learning style9 modes available
Introduction

We use JSON files because they store data in a simple, organized way. Reading JSON helps us get this data into Spark to analyze it. Nested data means some data is inside other data, like boxes inside boxes.

You have data from web APIs that comes in JSON format.
You want to analyze user information where addresses are inside user details.
You receive logs or events stored as JSON with nested fields.
You want to load configuration files saved as JSON with multiple levels.
Syntax
Apache Spark
spark.read.json(path)

This reads a JSON file or folder into a Spark DataFrame.

Spark automatically detects nested structures and creates columns for them.

Examples
Reads a simple JSON file and shows the data.
Apache Spark
df = spark.read.json("data/simple.json")
df.show()
Reads a JSON file with nested data and prints the schema to see the structure.
Apache Spark
df = spark.read.json("data/nested.json")
df.printSchema()
Selects a top-level field and a nested field inside 'address'.
Apache Spark
df.select("name", "address.city").show()
Sample Program

This program creates a Spark session, writes sample JSON data with nested fields to a file, reads it back into a DataFrame, and shows the data and schema. It also selects a nested field to demonstrate accessing nested data.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadJSONExample").getOrCreate()

# Sample JSON data with nested structure
json_data = [
  '{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}',
  '{"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}'
]

# Save sample data to a file
with open("sample.json", "w") as f:
    for line in json_data:
        f.write(line + "\n")

# Read the JSON file
df = spark.read.json("sample.json")

# Show the data
print("DataFrame content:")
df.show()

# Show the schema to understand nested data
print("DataFrame schema:")
df.printSchema()

# Select nested field
print("Selected nested field (address.city):")
df.select("name", "address.city").show()

spark.stop()
OutputSuccess
Important Notes

Nested fields are accessed using dot notation like address.city.

Use printSchema() to understand the structure of nested JSON data.

Make sure the JSON file is accessible to Spark (local or distributed file system).

Summary

JSON files store data in a structured way, often with nested fields.

Spark can read JSON files directly and handle nested data automatically.

Use dot notation to select nested fields inside the DataFrame.