0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Write Parquet Files in PySpark: Simple Guide

To write data as a Parquet file in PySpark, use the DataFrame.write.parquet(path) method. This saves the DataFrame in an efficient columnar format at the specified path. You can also specify options like mode to control overwrite behavior.
๐Ÿ“

Syntax

The basic syntax to write a DataFrame as a Parquet file in PySpark is:

  • DataFrame.write.parquet(path): Saves the DataFrame to the given path in Parquet format.
  • mode (optional): Controls how to handle existing data. Common modes are overwrite, append, ignore, and error (default).
  • partitionBy (optional): Allows partitioning the data by one or more columns for faster queries.
python
df.write.mode('overwrite').parquet('/path/to/save')
๐Ÿ’ป

Example

This example shows how to create a simple DataFrame and write it as a Parquet file. It demonstrates saving with overwrite mode and reading back the data.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ParquetExample').getOrCreate()

# Create sample data
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
columns = ['id', 'name']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Write DataFrame as Parquet file
output_path = '/tmp/example_parquet'
df.write.mode('overwrite').parquet(output_path)

# Read back the Parquet file
df_read = spark.read.parquet(output_path)
df_read.show()

spark.stop()
Output
+---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| | 3|Cathy| +---+-----+
โš ๏ธ

Common Pitfalls

  • Path already exists: Writing without mode='overwrite' will cause an error if the path exists.
  • Incorrect path format: Use absolute paths or supported file system URIs (e.g., hdfs://, s3://).
  • Schema mismatch on append: Appending data with a different schema causes errors.
  • Not stopping SparkSession: Always stop the SparkSession after use to free resources.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PitfallExample').getOrCreate()

data1 = [(1, 'Alice')]
data2 = [(2, 'Bob', 30)]  # Different schema

df1 = spark.createDataFrame(data1, ['id', 'name'])
df2 = spark.createDataFrame(data2, ['id', 'name', 'age'])

output_path = '/tmp/pitfall_parquet'

# Write first DataFrame

df1.write.mode('overwrite').parquet(output_path)

# Wrong: appending DataFrame with different schema causes error
# df2.write.mode('append').parquet(output_path)  # This will fail

# Right: ensure schemas match before appending

spark.stop()
๐Ÿ“Š

Quick Reference

Here is a quick summary of common write.parquet options:

OptionDescriptionExample
pathLocation to save the Parquet files'/data/output'
modeWrite mode: overwrite, append, ignore, error'overwrite'
partitionByColumns to partition data bypartitionBy('year', 'month')
compressionCompression codec: snappy, gzip, noneoption('compression', 'snappy')
โœ…

Key Takeaways

Use DataFrame.write.parquet(path) to save data in Parquet format efficiently.
Specify mode='overwrite' to replace existing files without errors.
Partition data by columns for faster queries using partitionBy().
Ensure schema consistency when appending data to Parquet files.
Always stop SparkSession after your job to release resources.