Challenge - 5 Problems

🎖️

Partition Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of repartition vs coalesce on DataFrame partitions

Consider a Spark DataFrame with 10 partitions. What will be the number of partitions after applying the following code?

df_repart = df.repartition(5)
df_coalesce = df.coalesce(5)

What are the number of partitions in df_repart and df_coalesce respectively?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(i,) for i in range(100)]
df = spark.createDataFrame(data, ['num']).repartition(10)

print(df.rdd.getNumPartitions())
df_repart = df.repartition(5)
df_coalesce = df.coalesce(5)
print(df_repart.rdd.getNumPartitions())
print(df_coalesce.rdd.getNumPartitions())

Adf_repart has 5 partitions, df_coalesce has 10 partitions

Bdf_repart has 10 partitions, df_coalesce has 5 partitions

Cdf_repart has 10 partitions, df_coalesce has 10 partitions

Ddf_repart has 5 partitions, df_coalesce has 5 partitions

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Difference in shuffle behavior between repartition and coalesce

Which statement correctly describes the shuffle behavior of repartition and coalesce in Spark?

ABoth repartition and coalesce always cause shuffle regardless of partition count.

BCoalesce always causes a shuffle; repartition avoids shuffle when increasing partitions.

CRepartition always causes a shuffle; coalesce avoids shuffle when decreasing partitions.

DNeither repartition nor coalesce ever cause shuffle.

Attempts:

2 left

❓ data_output

advanced

2:30remaining

Resulting partition sizes after coalesce without shuffle

Given a DataFrame with 8 partitions each containing 10 rows, what will be the approximate number of rows in each partition after applying df.coalesce(4) without shuffle?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(i,) for i in range(80)]
df = spark.createDataFrame(data, ['num']).repartition(8)

print(df.rdd.glom().map(len).collect())
df_coalesce = df.coalesce(4)
print(df_coalesce.rdd.glom().map(len).collect())

A[20, 20, 20, 20]

B[10, 10, 10, 10, 10, 10, 10, 10]

C[40, 10, 10, 10]

D[15, 15, 15, 15]

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Why does coalesce not increase partitions?

You run df.coalesce(20) on a DataFrame with 10 partitions. What will happen and why?

AThe DataFrame will still have 10 partitions because coalesce cannot increase partitions.

BThe DataFrame will have 5 partitions because coalesce halves the partitions automatically.

CThe code will raise an error because coalesce cannot accept a number greater than current partitions.

DThe DataFrame will have 20 partitions because coalesce can increase partitions.

Attempts:

2 left

🚀 Application

expert

3:00remaining

Choosing repartition vs coalesce for performance optimization

You have a large Spark DataFrame with 100 partitions. You want to reduce partitions to 10 for faster downstream processing. You also want to minimize data shuffle to save time. Which approach is best?

AUse df.repartition(10) to evenly distribute data with shuffle.

BUse df.coalesce(10) to reduce partitions without shuffle.

CUse df.coalesce(10, shuffle=True) to reduce partitions with shuffle.

DUse df.repartition(10, shuffle=False) to reduce partitions without shuffle.

Attempts:

2 left