How to read parquet in pyspark

Apache-sparkHow-ToBeginner · 3 min read

How to Read Parquet Files in PySpark: Simple Guide

To read a Parquet file in PySpark, use the spark.read.parquet(path) method where path is the file location. This loads the Parquet data into a DataFrame for easy processing.

📐

Syntax

The basic syntax to read a Parquet file in PySpark is:

spark.read.parquet(path): Reads the Parquet file from the specified path.
path: The location of the Parquet file or directory.
The method returns a DataFrame containing the data.

python

df = spark.read.parquet("/path/to/parquet/file")

💻

Example

This example shows how to create a Spark session, read a Parquet file, and display its content.

python

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("ReadParquetExample").getOrCreate()

# Read Parquet file
parquet_path = "example.parquet"
df = spark.read.parquet(parquet_path)

# Show data
print("Data from Parquet file:")
df.show()

# Stop Spark session
spark.stop()

Output

Data from Parquet file: +---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| +---+-----+

⚠️

Common Pitfalls

Common mistakes when reading Parquet files in PySpark include:

Using an incorrect file path or missing file causes errors.
Trying to read a non-Parquet file with read.parquet() will fail.
Not having a Spark session active before reading the file.
Confusing the path with a directory containing multiple Parquet files versus a single file.

Always verify the path and ensure the file format is Parquet.

python

from pyspark.sql import SparkSession

# Wrong way: no Spark session
# df = spark.read.parquet("file.parquet")  # This will fail because spark is not defined

# Right way:
spark = SparkSession.builder.appName("FixExample").getOrCreate()
df = spark.read.parquet("file.parquet")
df.show()
spark.stop()

📊

Quick Reference

Method	Description
spark.read.parquet(path)	Read Parquet file(s) into a DataFrame
df.show()	Display the first 20 rows of the DataFrame
spark.stop()	Stop the Spark session to free resources

✅

Key Takeaways

Use spark.read.parquet(path) to load Parquet files into a DataFrame.

Always create and use a Spark session before reading files.

Check the file path and format to avoid errors.

Use df.show() to quickly view the loaded data.

Stop the Spark session after your work to release resources.