0
0
Apache Sparkdata~5 mins

Writing output with partitioning in Apache Spark

Choose your learning style9 modes available
Introduction

Partitioning helps organize data when saving it. It splits data into folders by column values, making it easier to find and use later.

When saving large datasets to speed up reading specific parts.
When you want to group data by categories like date or region.
When preparing data for tools that read partitioned folders efficiently.
When you want to reduce the amount of data scanned during queries.
Syntax
Apache Spark
dataframe.write.partitionBy("column_name").format("file_format").save("path")
You can partition by one or more columns by passing multiple column names.
Partition columns create folders named like column=value inside the save path.
Examples
Saves the DataFrame as parquet files, partitioned by the 'year' column.
Apache Spark
df.write.partitionBy("year").parquet("/data/output")
Saves the DataFrame as CSV files, partitioned by 'country' and then 'month'.
Apache Spark
df.write.partitionBy("country", "month").csv("/data/output_csv")
Sample Program

This code creates a small DataFrame with sales data. It saves the data as parquet files, partitioned by 'year' and 'country'. Then it reads the data back and shows it.

Apache Spark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitionExample").getOrCreate()

# Create sample data
data = [
    (2023, "US", "Alice", 100),
    (2023, "US", "Bob", 200),
    (2023, "CA", "Charlie", 300),
    (2024, "US", "David", 400),
    (2024, "CA", "Eve", 500)
]

columns = ["year", "country", "name", "sales"]
df = spark.createDataFrame(data, columns)

# Write output partitioned by year and country
output_path = "/tmp/partitioned_output"
df.write.mode("overwrite").partitionBy("year", "country").parquet(output_path)

# Read back the data to verify
read_df = spark.read.parquet(output_path)
read_df.show()

spark.stop()
OutputSuccess
Important Notes

Partitioning creates folders on disk, so avoid too many unique values to prevent many small files.

Use partitioning columns that you often filter on to speed up queries.

Summary

Partitioning splits saved data into folders by column values.

It helps organize and speed up reading large datasets.

Use .write.partitionBy() with one or more columns before saving.