How to Write Parquet Files in PySpark: Simple Guide
To write data as a
Parquet file in PySpark, use the DataFrame.write.parquet(path) method. This saves the DataFrame in an efficient columnar format at the specified path. You can also specify options like mode to control overwrite behavior.Syntax
The basic syntax to write a DataFrame as a Parquet file in PySpark is:
DataFrame.write.parquet(path): Saves the DataFrame to the givenpathin Parquet format.mode(optional): Controls how to handle existing data. Common modes areoverwrite,append,ignore, anderror(default).partitionBy(optional): Allows partitioning the data by one or more columns for faster queries.
python
df.write.mode('overwrite').parquet('/path/to/save')
Example
This example shows how to create a simple DataFrame and write it as a Parquet file. It demonstrates saving with overwrite mode and reading back the data.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('ParquetExample').getOrCreate() # Create sample data data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')] columns = ['id', 'name'] # Create DataFrame df = spark.createDataFrame(data, columns) # Write DataFrame as Parquet file output_path = '/tmp/example_parquet' df.write.mode('overwrite').parquet(output_path) # Read back the Parquet file df_read = spark.read.parquet(output_path) df_read.show() spark.stop()
Output
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Cathy|
+---+-----+
Common Pitfalls
- Path already exists: Writing without
mode='overwrite'will cause an error if the path exists. - Incorrect path format: Use absolute paths or supported file system URIs (e.g.,
hdfs://,s3://). - Schema mismatch on append: Appending data with a different schema causes errors.
- Not stopping SparkSession: Always stop the SparkSession after use to free resources.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PitfallExample').getOrCreate() data1 = [(1, 'Alice')] data2 = [(2, 'Bob', 30)] # Different schema df1 = spark.createDataFrame(data1, ['id', 'name']) df2 = spark.createDataFrame(data2, ['id', 'name', 'age']) output_path = '/tmp/pitfall_parquet' # Write first DataFrame df1.write.mode('overwrite').parquet(output_path) # Wrong: appending DataFrame with different schema causes error # df2.write.mode('append').parquet(output_path) # This will fail # Right: ensure schemas match before appending spark.stop()
Quick Reference
Here is a quick summary of common write.parquet options:
| Option | Description | Example |
|---|---|---|
| path | Location to save the Parquet files | '/data/output' |
| mode | Write mode: overwrite, append, ignore, error | 'overwrite' |
| partitionBy | Columns to partition data by | partitionBy('year', 'month') |
| compression | Compression codec: snappy, gzip, none | option('compression', 'snappy') |
Key Takeaways
Use DataFrame.write.parquet(path) to save data in Parquet format efficiently.
Specify mode='overwrite' to replace existing files without errors.
Partition data by columns for faster queries using partitionBy().
Ensure schema consistency when appending data to Parquet files.
Always stop SparkSession after your job to release resources.