0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Read Parquet Files in PySpark: Simple Guide

To read a Parquet file in PySpark, use the spark.read.parquet(path) method where path is the file location. This loads the Parquet data into a DataFrame for easy processing.
๐Ÿ“

Syntax

The basic syntax to read a Parquet file in PySpark is:

  • spark.read.parquet(path): Reads the Parquet file from the specified path.
  • path: The location of the Parquet file or directory.
  • The method returns a DataFrame containing the data.
python
df = spark.read.parquet("/path/to/parquet/file")
๐Ÿ’ป

Example

This example shows how to create a Spark session, read a Parquet file, and display its content.

python
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("ReadParquetExample").getOrCreate()

# Read Parquet file
parquet_path = "example.parquet"
df = spark.read.parquet(parquet_path)

# Show data
print("Data from Parquet file:")
df.show()

# Stop Spark session
spark.stop()
Output
Data from Parquet file: +---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| +---+-----+
โš ๏ธ

Common Pitfalls

Common mistakes when reading Parquet files in PySpark include:

  • Using an incorrect file path or missing file causes errors.
  • Trying to read a non-Parquet file with read.parquet() will fail.
  • Not having a Spark session active before reading the file.
  • Confusing the path with a directory containing multiple Parquet files versus a single file.

Always verify the path and ensure the file format is Parquet.

python
from pyspark.sql import SparkSession

# Wrong way: no Spark session
# df = spark.read.parquet("file.parquet")  # This will fail because spark is not defined

# Right way:
spark = SparkSession.builder.appName("FixExample").getOrCreate()
df = spark.read.parquet("file.parquet")
df.show()
spark.stop()
๐Ÿ“Š

Quick Reference

MethodDescription
spark.read.parquet(path)Read Parquet file(s) into a DataFrame
df.show()Display the first 20 rows of the DataFrame
spark.stop()Stop the Spark session to free resources
โœ…

Key Takeaways

Use spark.read.parquet(path) to load Parquet files into a DataFrame.
Always create and use a Spark session before reading files.
Check the file path and format to avoid errors.
Use df.show() to quickly view the loaded data.
Stop the Spark session after your work to release resources.