0
0
Apache Sparkdata~20 mins

Partition tuning (repartition vs coalesce) in Apache Spark - Practice Questions

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Partition Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of repartition vs coalesce on DataFrame partitions
Consider a Spark DataFrame with 10 partitions. What will be the number of partitions after applying the following code?

df_repart = df.repartition(5)
df_coalesce = df.coalesce(5)

What are the number of partitions in df_repart and df_coalesce respectively?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(i,) for i in range(100)]
df = spark.createDataFrame(data, ['num']).repartition(10)

print(df.rdd.getNumPartitions())
df_repart = df.repartition(5)
df_coalesce = df.coalesce(5)
print(df_repart.rdd.getNumPartitions())
print(df_coalesce.rdd.getNumPartitions())
Adf_repart has 5 partitions, df_coalesce has 10 partitions
Bdf_repart has 10 partitions, df_coalesce has 5 partitions
Cdf_repart has 10 partitions, df_coalesce has 10 partitions
Ddf_repart has 5 partitions, df_coalesce has 5 partitions
Attempts:
2 left
💡 Hint
Both repartition and coalesce can change the number of partitions, but repartition always reshuffles data.
🧠 Conceptual
intermediate
1:30remaining
Difference in shuffle behavior between repartition and coalesce
Which statement correctly describes the shuffle behavior of repartition and coalesce in Spark?
ABoth repartition and coalesce always cause shuffle regardless of partition count.
BCoalesce always causes a shuffle; repartition avoids shuffle when increasing partitions.
CRepartition always causes a shuffle; coalesce avoids shuffle when decreasing partitions.
DNeither repartition nor coalesce ever cause shuffle.
Attempts:
2 left
💡 Hint
Think about data movement when changing partitions.
data_output
advanced
2:30remaining
Resulting partition sizes after coalesce without shuffle
Given a DataFrame with 8 partitions each containing 10 rows, what will be the approximate number of rows in each partition after applying df.coalesce(4) without shuffle?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(i,) for i in range(80)]
df = spark.createDataFrame(data, ['num']).repartition(8)

print(df.rdd.glom().map(len).collect())
df_coalesce = df.coalesce(4)
print(df_coalesce.rdd.glom().map(len).collect())
A[20, 20, 20, 20]
B[10, 10, 10, 10, 10, 10, 10, 10]
C[40, 10, 10, 10]
D[15, 15, 15, 15]
Attempts:
2 left
💡 Hint
Coalesce merges partitions without reshuffling data.
🔧 Debug
advanced
1:30remaining
Why does coalesce not increase partitions?
You run df.coalesce(20) on a DataFrame with 10 partitions. What will happen and why?
AThe DataFrame will still have 10 partitions because coalesce cannot increase partitions.
BThe DataFrame will have 5 partitions because coalesce halves the partitions automatically.
CThe code will raise an error because coalesce cannot accept a number greater than current partitions.
DThe DataFrame will have 20 partitions because coalesce can increase partitions.
Attempts:
2 left
💡 Hint
Check the behavior of coalesce when asked to increase partitions.
🚀 Application
expert
3:00remaining
Choosing repartition vs coalesce for performance optimization
You have a large Spark DataFrame with 100 partitions. You want to reduce partitions to 10 for faster downstream processing. You also want to minimize data shuffle to save time. Which approach is best?
AUse df.repartition(10) to evenly distribute data with shuffle.
BUse df.coalesce(10) to reduce partitions without shuffle.
CUse df.coalesce(10, shuffle=True) to reduce partitions with shuffle.
DUse df.repartition(10, shuffle=False) to reduce partitions without shuffle.
Attempts:
2 left
💡 Hint
Consider shuffle cost and partition reduction.