We use JSON files because they store data in a simple, organized way. Reading JSON helps us get this data into Spark to analyze it. Nested data means some data is inside other data, like boxes inside boxes.
Reading JSON and nested data in Apache Spark
spark.read.json(path)
This reads a JSON file or folder into a Spark DataFrame.
Spark automatically detects nested structures and creates columns for them.
df = spark.read.json("data/simple.json")
df.show()df = spark.read.json("data/nested.json")
df.printSchema()df.select("name", "address.city").show()
This program creates a Spark session, writes sample JSON data with nested fields to a file, reads it back into a DataFrame, and shows the data and schema. It also selects a nested field to demonstrate accessing nested data.
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ReadJSONExample").getOrCreate() # Sample JSON data with nested structure json_data = [ '{"name": "Alice", "age": 30, "address": {"city": "New York", "zip": "10001"}}', '{"name": "Bob", "age": 25, "address": {"city": "Los Angeles", "zip": "90001"}}' ] # Save sample data to a file with open("sample.json", "w") as f: for line in json_data: f.write(line + "\n") # Read the JSON file df = spark.read.json("sample.json") # Show the data print("DataFrame content:") df.show() # Show the schema to understand nested data print("DataFrame schema:") df.printSchema() # Select nested field print("Selected nested field (address.city):") df.select("name", "address.city").show() spark.stop()
Nested fields are accessed using dot notation like address.city.
Use printSchema() to understand the structure of nested JSON data.
Make sure the JSON file is accessible to Spark (local or distributed file system).
JSON files store data in a structured way, often with nested fields.
Spark can read JSON files directly and handle nested data automatically.
Use dot notation to select nested fields inside the DataFrame.