How to Create DataFrame from JSON in PySpark
To create a DataFrame from JSON in PySpark, use
SparkSession.read.json() method by passing the JSON file path or JSON string. This method automatically infers the schema and loads the data into a DataFrame for easy processing.Syntax
The basic syntax to create a DataFrame from JSON in PySpark is:
spark.read.json(path): Reads JSON data from a file or directory.spark.read.json(rdd_of_json_strings): Reads JSON data from an RDD of JSON strings.spark.read.json(list_of_json_strings): Reads JSON data from a list of JSON strings (use parallelize to convert list to RDD).
The spark is a SparkSession object.
python
df = spark.read.json("path/to/json/file.json")Example
This example shows how to create a DataFrame from a JSON string list and display its content.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").appName("JsonExample").getOrCreate() json_data = [ '{"name":"Alice","age":30}', '{"name":"Bob","age":25}', '{"name":"Charlie","age":35}' ] df = spark.read.json(spark.sparkContext.parallelize(json_data)) df.show()
Output
+-------+---+
| name|age|
+-------+---+
| Alice| 30|
| Bob| 25|
|Charlie| 35|
+-------+---+
Common Pitfalls
Common mistakes when creating DataFrames from JSON in PySpark include:
- Passing invalid JSON strings or malformed files causes errors.
- Not using
spark.sparkContext.parallelize()when reading JSON from a Python list causes type errors. - Assuming schema is always correct; sometimes you need to specify schema explicitly for complex JSON.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.master("local").appName("JsonExample").getOrCreate() # Wrong: Passing list directly without parallelize json_data = ['{"name":"Alice","age":30}'] # This will raise an error: # df = spark.read.json(json_data) # WRONG # Correct way: df = spark.read.json(spark.sparkContext.parallelize(json_data)) df.show()
Output
+-----+---+
| name|age|
+-----+---+
|Alice| 30|
+-----+---+
Quick Reference
| Method | Description |
|---|---|
| spark.read.json(path) | Read JSON data from file or directory path |
| spark.read.json(rdd_of_json_strings) | Read JSON data from RDD of JSON strings |
| spark.read.json(list_of_json_strings) | Read JSON data from list of JSON strings (use parallelize) |
| df.show() | Display DataFrame content |
| df.printSchema() | Print inferred schema of DataFrame |
Key Takeaways
Use spark.read.json() to load JSON data into a DataFrame in PySpark.
When reading JSON from a Python list, convert it to an RDD using sparkContext.parallelize().
PySpark infers schema automatically but specify schema explicitly for complex JSON.
Malformed JSON or wrong input types cause errors, so validate your JSON data.
Use df.show() to quickly view the loaded DataFrame content.