How to Use partitionBy in PySpark: Syntax and Examples
In PySpark,
partitionBy is used to divide data into separate partitions based on one or more columns, commonly during writing data to storage or when defining window functions. It helps organize data efficiently by grouping rows with the same column values together.Syntax
The partitionBy method is used in two main contexts in PySpark:
- DataFrameWriter: When saving data,
partitionByspecifies columns to split the output files into folders by unique values. - Window functions:
partitionBydefines how data is grouped before applying window calculations.
Basic syntax for writing data:
df.write.partitionBy(<column_names>).format(<format>).save(<path>)
Basic syntax for window functions:
from pyspark.sql.window import Window window_spec = Window.partitionBy(<column_names>).orderBy(<column_name>)
python
from pyspark.sql import SparkSession from pyspark.sql.window import Window from pyspark.sql.functions import row_number spark = SparkSession.builder.appName('PartitionByExample').getOrCreate() # Example DataFrame data = [(1, 'A', 100), (2, 'A', 200), (3, 'B', 300), (4, 'B', 400)] columns = ['id', 'category', 'value'] df = spark.createDataFrame(data, columns) # Window partitionBy example window_spec = Window.partitionBy('category').orderBy('value') df_with_row_num = df.withColumn('row_num', row_number().over(window_spec)) df_with_row_num.show()
Output
+---+--------+-----+-------+
| id|category|value|row_num|
+---+--------+-----+-------+
| 1| A| 100| 1|
| 2| A| 200| 2|
| 3| B| 300| 1|
| 4| B| 400| 2|
+---+--------+-----+-------+
Example
This example shows how to use partitionBy when writing a DataFrame to disk. It saves the data into folders based on the category column, so each category's data is stored separately.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('PartitionByWriteExample').getOrCreate() data = [(1, 'A', 100), (2, 'A', 200), (3, 'B', 300), (4, 'B', 400)] columns = ['id', 'category', 'value'] df = spark.createDataFrame(data, columns) # Write DataFrame partitioned by 'category' column output_path = '/tmp/partitioned_data' df.write.mode('overwrite').partitionBy('category').parquet(output_path) # Read back the data to verify read_df = spark.read.parquet(output_path) read_df.show()
Output
+---+--------+-----+
| id|category|value|
+---+--------+-----+
| 1| A| 100|
| 2| A| 200|
| 3| B| 300|
| 4| B| 400|
+---+--------+-----+
Common Pitfalls
Common mistakes when using partitionBy include:
- Using
partitionBywithout specifying columns, which causes errors. - Expecting
partitionByto reorder data inside the DataFrame; it only affects output file organization or window grouping. - Not using
mode('overwrite')when writing partitioned data, which can cause write failures if data exists.
Example of wrong and right usage:
python
# Wrong: missing column name try: df.write.partitionBy().parquet('/tmp/wrong_partition') except Exception as e: print(f'Error: {e}') # Right: specify column name output_path = '/tmp/right_partition' df.write.mode('overwrite').partitionBy('category').parquet(output_path)
Output
Error: partitionBy() requires at least one column name
Quick Reference
| Use Case | Syntax | Description |
|---|---|---|
| Writing partitioned data | df.write.partitionBy('col').format('parquet').save(path) | Saves data split into folders by unique values of 'col' |
| Window function partitioning | Window.partitionBy('col').orderBy('col2') | Groups data by 'col' before applying window functions |
| Multiple columns | partitionBy('col1', 'col2') | Partitions data by combinations of multiple columns |
| Overwrite mode | df.write.mode('overwrite').partitionBy(...) | Avoids errors when writing to existing paths |
Key Takeaways
Use partitionBy to organize data by column values during write or window operations.
Always specify at least one column name in partitionBy to avoid errors.
partitionBy in writing splits output into folders; it does not reorder DataFrame rows.
Combine partitionBy with mode('overwrite') to safely overwrite existing data.
In window functions, partitionBy groups data before applying calculations like row_number.