Apache Sparkdata~30 mins

Partition tuning (repartition vs coalesce) in Apache Spark - Hands-On Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Partition tuning with repartition vs coalesce in Apache Spark

📖 Scenario: You work with a large dataset of sales records. You want to optimize how Spark processes this data by changing the number of partitions. This helps Spark run faster and use resources better.

🎯 Goal: Learn how to change the number of partitions in a Spark DataFrame using repartition() and coalesce(). See how these methods affect the data partitions.

📋 What You'll Learn

Create a Spark DataFrame with sample sales data

Set a target number of partitions

Use repartition() to increase partitions

Use coalesce() to decrease partitions

Print the number of partitions after each operation

💡 Why This Matters

🌍 Real World

Data engineers and data scientists often tune partitions in Spark to improve job speed and resource use when processing big data.

💼 Career

Knowing how to repartition and coalesce data is important for optimizing Spark jobs in roles like data engineering, big data analytics, and machine learning pipelines.

Progress0 / 4 steps

Create a Spark DataFrame with sales data

Create a Spark DataFrame called sales_df with these exact rows: ("2023-01-01", "StoreA", 100), ("2023-01-02", "StoreB", 150), ("2023-01-03", "StoreA", 200), ("2023-01-04", "StoreC", 50). Use columns named date, store, and sales.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitionTuning").getOrCreate()

# Create sales_df DataFrame with the exact rows and columns
# Your code here

Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

Set a target number of partitions

Create a variable called target_partitions and set it to 6. This will be the number of partitions we want to use.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitionTuning").getOrCreate()

sales_data = [("2023-01-01", "StoreA", 100), ("2023-01-02", "StoreB", 150), ("2023-01-03", "StoreA", 200), ("2023-01-04", "StoreC", 50)]
sales_df = spark.createDataFrame(sales_data, ["date", "store", "sales"])

# Create target_partitions variable and set to 6
# Your code here

Need a hint?

Just create a variable named target_partitions and assign the number 6.

Use repartition() and coalesce() to change partitions

Create a new DataFrame called repartitioned_df by calling sales_df.repartition(target_partitions). Then create another DataFrame called coalesced_df by calling repartitioned_df.coalesce(2).

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitionTuning").getOrCreate()

sales_data = [("2023-01-01", "StoreA", 100), ("2023-01-02", "StoreB", 150), ("2023-01-03", "StoreA", 200), ("2023-01-04", "StoreC", 50)]
sales_df = spark.createDataFrame(sales_data, ["date", "store", "sales"])

target_partitions = 6

# Use repartition() to create repartitioned_df with target_partitions
# Use coalesce() on repartitioned_df to create coalesced_df with 2 partitions
# Your code here

Need a hint?

Use repartition() to increase partitions and coalesce() to reduce partitions.

Print the number of partitions after repartition and coalesce

Print the number of partitions in sales_df, repartitioned_df, and coalesced_df by using .rdd.getNumPartitions(). Use three separate print statements with the exact text: "Original partitions: ", "After repartition: ", and "After coalesce: " followed by the number.

Apache Spark

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PartitionTuning").getOrCreate()

sales_data = [("2023-01-01", "StoreA", 100), ("2023-01-02", "StoreB", 150), ("2023-01-03", "StoreA", 200), ("2023-01-04", "StoreC", 50)]
sales_df = spark.createDataFrame(sales_data, ["date", "store", "sales"])

target_partitions = 6

repartitioned_df = sales_df.repartition(target_partitions)
coalesced_df = repartitioned_df.coalesce(2)

# Print the number of partitions for sales_df, repartitioned_df, and coalesced_df
# Your code here

Need a hint?

Use print() and .rdd.getNumPartitions() on each DataFrame.