What is Writing output with partitioning in Apache Spark?

Apache Sparkdata~5 mins

Writing output with partitioning in Apache Spark

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Partitioning helps organize data when saving it. It splits data into folders by column values, making it easier to find and use later.

When saving large datasets to speed up reading specific parts.

When you want to group data by categories like date or region.

When preparing data for tools that read partitioned folders efficiently.

When you want to reduce the amount of data scanned during queries.

Syntax

Apache Spark

dataframe.write.partitionBy("column_name").format("file_format").save("path")

You can partition by one or more columns by passing multiple column names.

Partition columns create folders named like column=value inside the save path.

Examples

Saves the DataFrame as parquet files, partitioned by the 'year' column.

Apache Spark

df.write.partitionBy("year").parquet("/data/output")

Saves the DataFrame as CSV files, partitioned by 'country' and then 'month'.

Apache Spark

df.write.partitionBy("country", "month").csv("/data/output_csv")

Sample Program

This code creates a small DataFrame with sales data. It saves the data as parquet files, partitioned by 'year' and 'country'. Then it reads the data back and shows it.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitionExample").getOrCreate()

# Create sample data
data = [
    (2023, "US", "Alice", 100),
    (2023, "US", "Bob", 200),
    (2023, "CA", "Charlie", 300),
    (2024, "US", "David", 400),
    (2024, "CA", "Eve", 500)
]

columns = ["year", "country", "name", "sales"]
df = spark.createDataFrame(data, columns)

# Write output partitioned by year and country
output_path = "/tmp/partitioned_output"
df.write.mode("overwrite").partitionBy("year", "country").parquet(output_path)

# Read back the data to verify
read_df = spark.read.parquet(output_path)
read_df.show()

spark.stop()

OutputSuccess

Important Notes

Partitioning creates folders on disk, so avoid too many unique values to prevent many small files.

Use partitioning columns that you often filter on to speed up queries.

Summary

Partitioning splits saved data into folders by column values.

It helps organize and speed up reading large datasets.

Use .write.partitionBy() with one or more columns before saving.