How to Use partitionBy for Write in PySpark
In PySpark, use
partitionBy with the write method to save data partitioned by one or more columns. This creates separate folders for each unique value in the specified columns, making data easier to manage and query.Syntax
The partitionBy method is used with the DataFrameWriter to specify columns for partitioning data when writing. It is chained before the save or saveAsTable method.
df.write.partitionBy(<col1>, <col2>, ...).format(<format>).save(<path>)partitionBy: one or more column names to partition data byformat: output file format like 'parquet', 'csv', etc.save: path to save the partitioned data
python
df.write.partitionBy('column1', 'column2').format('parquet').save('/path/to/output')
Example
This example shows how to write a PySpark DataFrame partitioned by the 'department' column into Parquet files. Each department's data will be saved in a separate folder.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PartitionByExample').getOrCreate() # Sample data data = [ (1, 'Alice', 'HR'), (2, 'Bob', 'IT'), (3, 'Cathy', 'HR'), (4, 'David', 'Finance'), (5, 'Eva', 'IT') ] columns = ['id', 'name', 'department'] df = spark.createDataFrame(data, columns) # Write data partitioned by 'department' df.write.partitionBy('department').parquet('/tmp/employee_partitioned') spark.stop()
Output
Data saved to /tmp/employee_partitioned with folders:
/tmp/employee_partitioned/department=Finance/
/tmp/employee_partitioned/department=HR/
/tmp/employee_partitioned/department=IT/
Common Pitfalls
- Not specifying
partitionBybeforesaveorparquetcauses data to be saved without partitions. - Partitioning by columns with high cardinality (many unique values) can create too many small files, hurting performance.
- Using columns with null values for partitioning can create folders named
column=null, which might be unexpected.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('WrongPartition').getOrCreate() # Sample data data = [(1, 'Alice', 'HR'), (2, 'Bob', None)] columns = ['id', 'name', 'department'] df = spark.createDataFrame(data, columns) # Wrong: partitionBy after save (won't partition) df.write.save('/tmp/wrong_partition') # Right: partitionBy before save # df.write.partitionBy('department').save('/tmp/right_partition') spark.stop()
Quick Reference
Use this quick guide when writing partitioned data in PySpark:
| Method | Description | Example |
|---|---|---|
| partitionBy | Specify columns to partition data by | df.write.partitionBy('col').parquet(path) |
| format | Set output file format | df.write.format('csv').save(path) |
| save | Save data to path | df.write.save('/path/to/save') |
| parquet | Save data in Parquet format | df.write.parquet('/path') |
Key Takeaways
Use partitionBy before save or parquet to organize output by column values.
Partitioning creates folders for each unique value in specified columns.
Avoid partitioning on columns with too many unique values to prevent many small files.
Null values in partition columns create folders named with 'null'.
Always chain partitionBy before the write action to take effect.