How to Read Parquet Files in PySpark: Simple Guide
To read a Parquet file in PySpark, use the
spark.read.parquet(path) method where path is the file location. This loads the Parquet data into a DataFrame for easy processing.Syntax
The basic syntax to read a Parquet file in PySpark is:
spark.read.parquet(path): Reads the Parquet file from the specifiedpath.path: The location of the Parquet file or directory.- The method returns a
DataFramecontaining the data.
python
df = spark.read.parquet("/path/to/parquet/file")Example
This example shows how to create a Spark session, read a Parquet file, and display its content.
python
from pyspark.sql import SparkSession # Create Spark session spark = SparkSession.builder.appName("ReadParquetExample").getOrCreate() # Read Parquet file parquet_path = "example.parquet" df = spark.read.parquet(parquet_path) # Show data print("Data from Parquet file:") df.show() # Stop Spark session spark.stop()
Output
Data from Parquet file:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
+---+-----+
Common Pitfalls
Common mistakes when reading Parquet files in PySpark include:
- Using an incorrect file path or missing file causes errors.
- Trying to read a non-Parquet file with
read.parquet()will fail. - Not having a Spark session active before reading the file.
- Confusing the path with a directory containing multiple Parquet files versus a single file.
Always verify the path and ensure the file format is Parquet.
python
from pyspark.sql import SparkSession # Wrong way: no Spark session # df = spark.read.parquet("file.parquet") # This will fail because spark is not defined # Right way: spark = SparkSession.builder.appName("FixExample").getOrCreate() df = spark.read.parquet("file.parquet") df.show() spark.stop()
Quick Reference
| Method | Description |
|---|---|
| spark.read.parquet(path) | Read Parquet file(s) into a DataFrame |
| df.show() | Display the first 20 rows of the DataFrame |
| spark.stop() | Stop the Spark session to free resources |
Key Takeaways
Use spark.read.parquet(path) to load Parquet files into a DataFrame.
Always create and use a Spark session before reading files.
Check the file path and format to avoid errors.
Use df.show() to quickly view the loaded data.
Stop the Spark session after your work to release resources.