0
0
Apache Sparkdata~30 mins

Partition tuning (repartition vs coalesce) in Apache Spark - Hands-On Comparison

Choose your learning style9 modes available
Partition tuning with repartition vs coalesce in Apache Spark
📖 Scenario: You work with a large dataset of sales records. You want to optimize how Spark processes this data by changing the number of partitions. This helps Spark run faster and use resources better.
🎯 Goal: Learn how to change the number of partitions in a Spark DataFrame using repartition() and coalesce(). See how these methods affect the data partitions.
📋 What You'll Learn
Create a Spark DataFrame with sample sales data
Set a target number of partitions
Use repartition() to increase partitions
Use coalesce() to decrease partitions
Print the number of partitions after each operation
💡 Why This Matters
🌍 Real World
Data engineers and data scientists often tune partitions in Spark to improve job speed and resource use when processing big data.
💼 Career
Knowing how to repartition and coalesce data is important for optimizing Spark jobs in roles like data engineering, big data analytics, and machine learning pipelines.
Progress0 / 4 steps
1
Create a Spark DataFrame with sales data
Create a Spark DataFrame called sales_df with these exact rows: ("2023-01-01", "StoreA", 100), ("2023-01-02", "StoreB", 150), ("2023-01-03", "StoreA", 200), ("2023-01-04", "StoreC", 50). Use columns named date, store, and sales.
Apache Spark
Need a hint?

Use spark.createDataFrame() with a list of tuples and a list of column names.

2
Set a target number of partitions
Create a variable called target_partitions and set it to 6. This will be the number of partitions we want to use.
Apache Spark
Need a hint?

Just create a variable named target_partitions and assign the number 6.

3
Use repartition() and coalesce() to change partitions
Create a new DataFrame called repartitioned_df by calling sales_df.repartition(target_partitions). Then create another DataFrame called coalesced_df by calling repartitioned_df.coalesce(2).
Apache Spark
Need a hint?

Use repartition() to increase partitions and coalesce() to reduce partitions.

4
Print the number of partitions after repartition and coalesce
Print the number of partitions in sales_df, repartitioned_df, and coalesced_df by using .rdd.getNumPartitions(). Use three separate print statements with the exact text: "Original partitions: ", "After repartition: ", and "After coalesce: " followed by the number.
Apache Spark
Need a hint?

Use print() and .rdd.getNumPartitions() on each DataFrame.