0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use partitionBy for Write in PySpark

In PySpark, use partitionBy with the write method to save data partitioned by one or more columns. This creates separate folders for each unique value in the specified columns, making data easier to manage and query.
๐Ÿ“

Syntax

The partitionBy method is used with the DataFrameWriter to specify columns for partitioning data when writing. It is chained before the save or saveAsTable method.

  • df.write.partitionBy(<col1>, <col2>, ...).format(<format>).save(<path>)
  • partitionBy: one or more column names to partition data by
  • format: output file format like 'parquet', 'csv', etc.
  • save: path to save the partitioned data
python
df.write.partitionBy('column1', 'column2').format('parquet').save('/path/to/output')
๐Ÿ’ป

Example

This example shows how to write a PySpark DataFrame partitioned by the 'department' column into Parquet files. Each department's data will be saved in a separate folder.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('PartitionByExample').getOrCreate()

# Sample data
data = [
    (1, 'Alice', 'HR'),
    (2, 'Bob', 'IT'),
    (3, 'Cathy', 'HR'),
    (4, 'David', 'Finance'),
    (5, 'Eva', 'IT')
]

columns = ['id', 'name', 'department']
df = spark.createDataFrame(data, columns)

# Write data partitioned by 'department'
df.write.partitionBy('department').parquet('/tmp/employee_partitioned')

spark.stop()
Output
Data saved to /tmp/employee_partitioned with folders: /tmp/employee_partitioned/department=Finance/ /tmp/employee_partitioned/department=HR/ /tmp/employee_partitioned/department=IT/
โš ๏ธ

Common Pitfalls

  • Not specifying partitionBy before save or parquet causes data to be saved without partitions.
  • Partitioning by columns with high cardinality (many unique values) can create too many small files, hurting performance.
  • Using columns with null values for partitioning can create folders named column=null, which might be unexpected.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('WrongPartition').getOrCreate()

# Sample data
data = [(1, 'Alice', 'HR'), (2, 'Bob', None)]
columns = ['id', 'name', 'department']
df = spark.createDataFrame(data, columns)

# Wrong: partitionBy after save (won't partition)
df.write.save('/tmp/wrong_partition')

# Right: partitionBy before save
# df.write.partitionBy('department').save('/tmp/right_partition')

spark.stop()
๐Ÿ“Š

Quick Reference

Use this quick guide when writing partitioned data in PySpark:

MethodDescriptionExample
partitionBySpecify columns to partition data bydf.write.partitionBy('col').parquet(path)
formatSet output file formatdf.write.format('csv').save(path)
saveSave data to pathdf.write.save('/path/to/save')
parquetSave data in Parquet formatdf.write.parquet('/path')
โœ…

Key Takeaways

Use partitionBy before save or parquet to organize output by column values.
Partitioning creates folders for each unique value in specified columns.
Avoid partitioning on columns with too many unique values to prevent many small files.
Null values in partition columns create folders named with 'null'.
Always chain partitionBy before the write action to take effect.