0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Read JSON Files in PySpark: Simple Guide

To read JSON files in PySpark, use the spark.read.json(path) method where path is the location of your JSON file or folder. This loads the JSON data into a DataFrame for easy processing and analysis.
๐Ÿ“

Syntax

The basic syntax to read JSON in PySpark is:

  • spark.read.json(path): Reads JSON data from the specified path.
  • path: Can be a single file or a directory containing JSON files.
  • The result is a DataFrame that you can use for further data processing.
python
df = spark.read.json("/path/to/jsonfile.json")
๐Ÿ’ป

Example

This example shows how to read a JSON file into a DataFrame and display its content.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadJSONExample").getOrCreate()

# Sample JSON file path
json_path = "sample.json"

# Read JSON file
df = spark.read.json(json_path)

# Show the DataFrame content
df.show()

# Stop Spark session
spark.stop()
Output
+-------+---+ | name|age| +-------+---+ | Alice| 25| | Bob| 30| |Charlie| 35| +-------+---+
โš ๏ธ

Common Pitfalls

Common mistakes when reading JSON in PySpark include:

  • Using an incorrect file path or missing file causes errors.
  • Trying to read non-JSON files with read.json leads to parsing errors.
  • Not handling nested JSON structures properly can cause missing or incorrect columns.
  • For multiline JSON files, you must specify multiLine=True in read.json.

Example of reading multiline JSON correctly:

python
df = spark.read.json("multiline.json", multiLine=True)
๐Ÿ“Š

Quick Reference

OptionDescription
pathLocation of JSON file or directory
multiLine=TrueUse for JSON files with multiple lines per record
schemaDefine schema to optimize reading and avoid inference
modeSpecify error handling mode: permissive, dropMalformed, failFast
โœ…

Key Takeaways

Use spark.read.json(path) to load JSON files into a DataFrame.
Specify multiLine=True for JSON files with records spanning multiple lines.
Always check the file path and format to avoid read errors.
Define a schema when possible to improve performance and accuracy.
Use DataFrame methods like show() to inspect loaded JSON data.