Challenge - 5 Problems
Partition Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of repartition vs coalesce on DataFrame partitions
Consider a Spark DataFrame with 10 partitions. What will be the number of partitions after applying the following code?
What are the number of partitions in
df_repart = df.repartition(5)df_coalesce = df.coalesce(5)What are the number of partitions in
df_repart and df_coalesce respectively?Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(i,) for i in range(100)] df = spark.createDataFrame(data, ['num']).repartition(10) print(df.rdd.getNumPartitions()) df_repart = df.repartition(5) df_coalesce = df.coalesce(5) print(df_repart.rdd.getNumPartitions()) print(df_coalesce.rdd.getNumPartitions())
Attempts:
2 left
💡 Hint
Both repartition and coalesce can change the number of partitions, but repartition always reshuffles data.
✗ Incorrect
Both repartition(5) and coalesce(5) reduce the number of partitions to 5. Repartition reshuffles data fully, while coalesce tries to avoid shuffle but still reduces partitions.
🧠 Conceptual
intermediate1:30remaining
Difference in shuffle behavior between repartition and coalesce
Which statement correctly describes the shuffle behavior of repartition and coalesce in Spark?
Attempts:
2 left
💡 Hint
Think about data movement when changing partitions.
✗ Incorrect
Repartition always triggers a full shuffle to evenly distribute data. Coalesce tries to avoid shuffle when reducing partitions by merging existing partitions.
❓ data_output
advanced2:30remaining
Resulting partition sizes after coalesce without shuffle
Given a DataFrame with 8 partitions each containing 10 rows, what will be the approximate number of rows in each partition after applying
df.coalesce(4) without shuffle?Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(i,) for i in range(80)] df = spark.createDataFrame(data, ['num']).repartition(8) print(df.rdd.glom().map(len).collect()) df_coalesce = df.coalesce(4) print(df_coalesce.rdd.glom().map(len).collect())
Attempts:
2 left
💡 Hint
Coalesce merges partitions without reshuffling data.
✗ Incorrect
Coalesce merges adjacent partitions, so 8 partitions with 10 rows each become 4 partitions with about 20 rows each.
🔧 Debug
advanced1:30remaining
Why does coalesce not increase partitions?
You run
df.coalesce(20) on a DataFrame with 10 partitions. What will happen and why?Attempts:
2 left
💡 Hint
Check the behavior of coalesce when asked to increase partitions.
✗ Incorrect
Coalesce only reduces partitions; it does not increase them. To increase partitions, repartition must be used.
🚀 Application
expert3:00remaining
Choosing repartition vs coalesce for performance optimization
You have a large Spark DataFrame with 100 partitions. You want to reduce partitions to 10 for faster downstream processing. You also want to minimize data shuffle to save time. Which approach is best?
Attempts:
2 left
💡 Hint
Consider shuffle cost and partition reduction.
✗ Incorrect
Coalesce reduces partitions without shuffle, which is faster but may cause uneven data distribution. Repartition always shuffles data.