How to Read JSON Files in PySpark: Simple Guide
To read JSON files in PySpark, use the
spark.read.json(path) method where path is the location of your JSON file or folder. This loads the JSON data into a DataFrame for easy processing and analysis.Syntax
The basic syntax to read JSON in PySpark is:
spark.read.json(path): Reads JSON data from the specifiedpath.path: Can be a single file or a directory containing JSON files.- The result is a
DataFramethat you can use for further data processing.
python
df = spark.read.json("/path/to/jsonfile.json")Example
This example shows how to read a JSON file into a DataFrame and display its content.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ReadJSONExample").getOrCreate() # Sample JSON file path json_path = "sample.json" # Read JSON file df = spark.read.json(json_path) # Show the DataFrame content df.show() # Stop Spark session spark.stop()
Output
+-------+---+
| name|age|
+-------+---+
| Alice| 25|
| Bob| 30|
|Charlie| 35|
+-------+---+
Common Pitfalls
Common mistakes when reading JSON in PySpark include:
- Using an incorrect file path or missing file causes errors.
- Trying to read non-JSON files with
read.jsonleads to parsing errors. - Not handling nested JSON structures properly can cause missing or incorrect columns.
- For multiline JSON files, you must specify
multiLine=Trueinread.json.
Example of reading multiline JSON correctly:
python
df = spark.read.json("multiline.json", multiLine=True)
Quick Reference
| Option | Description |
|---|---|
| path | Location of JSON file or directory |
| multiLine=True | Use for JSON files with multiple lines per record |
| schema | Define schema to optimize reading and avoid inference |
| mode | Specify error handling mode: permissive, dropMalformed, failFast |
Key Takeaways
Use spark.read.json(path) to load JSON files into a DataFrame.
Specify multiLine=True for JSON files with records spanning multiple lines.
Always check the file path and format to avoid read errors.
Define a schema when possible to improve performance and accuracy.
Use DataFrame methods like show() to inspect loaded JSON data.