How to read json in pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Read JSON Files in PySpark: Simple Guide

To read JSON files in PySpark, use the spark.read.json(path) method where path is the location of your JSON file or folder. This loads the JSON data into a DataFrame for easy processing and analysis.

📐

Syntax

The basic syntax to read JSON in PySpark is:

spark.read.json(path): Reads JSON data from the specified path.
path: Can be a single file or a directory containing JSON files.
The result is a DataFrame that you can use for further data processing.

python

df = spark.read.json("/path/to/jsonfile.json")

💻

Example

This example shows how to read a JSON file into a DataFrame and display its content.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadJSONExample").getOrCreate()

# Sample JSON file path
json_path = "sample.json"

# Read JSON file
df = spark.read.json(json_path)

# Show the DataFrame content
df.show()

# Stop Spark session
spark.stop()

Output

+-------+---+ | name|age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 35| +-------+---+

⚠️

Common Pitfalls

Common mistakes when reading JSON in PySpark include:

Using an incorrect file path or missing file causes errors.
Trying to read non-JSON files with read.json leads to parsing errors.
Not handling nested JSON structures properly can cause missing or incorrect columns.
For multiline JSON files, you must specify multiLine=True in read.json.

Example of reading multiline JSON correctly:

python

df = spark.read.json("multiline.json", multiLine=True)

📊

Quick Reference

Option	Description
path	Location of JSON file or directory
multiLine=True	Use for JSON files with multiple lines per record
schema	Define schema to optimize reading and avoid inference
mode	Specify error handling mode: permissive, dropMalformed, failFast

✅

Key Takeaways

Use spark.read.json(path) to load JSON files into a DataFrame.

Specify multiLine=True for JSON files with records spanning multiple lines.

Always check the file path and format to avoid read errors.

Define a schema when possible to improve performance and accuracy.

Use DataFrame methods like show() to inspect loaded JSON data.