Challenge - 5 Problems
Partition Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
What is the number of partitions after this operation?
Given a Spark DataFrame with 4 partitions, what will be the number of partitions after applying
df.repartition(6)?Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')] df = spark.createDataFrame(data, ['id', 'value']) df = df.repartition(6) num_partitions = df.rdd.getNumPartitions() print(num_partitions)
Attempts:
2 left
💡 Hint
Repartition changes the number of partitions to the specified number.
✗ Incorrect
The repartition method reshuffles the data and creates exactly the number of partitions specified, which is 6 here.
🧠 Conceptual
intermediate1:30remaining
Why is partitioning important in Spark?
Which of the following best explains why partitioning data is important in Apache Spark?
Attempts:
2 left
💡 Hint
Think about how Spark processes data across a cluster.
✗ Incorrect
Partitioning splits data into chunks that can be processed in parallel on different nodes, improving speed and efficiency.
❓ data_output
advanced2:00remaining
What is the output of this partition count code?
Consider this Spark code snippet. What will be printed as the number of partitions?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(i,) for i in range(10)] df = spark.createDataFrame(data, ['num']) df2 = df.coalesce(2) print(df2.rdd.getNumPartitions())
Attempts:
2 left
💡 Hint
Coalesce reduces the number of partitions without full shuffle.
✗ Incorrect
The coalesce method reduces partitions to the specified number, here 2, without a full shuffle.
🔧 Debug
advanced2:00remaining
Identify the error in this partitioning code
What error will this Spark code raise?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(1, 'x'), (2, 'y')] df = spark.createDataFrame(data, ['id', 'val']) df2 = df.repartition(-3) print(df2.rdd.getNumPartitions())
Attempts:
2 left
💡 Hint
Number of partitions cannot be negative.
✗ Incorrect
Repartition requires a positive integer for number of partitions; negative values cause ValueError.
🚀 Application
expert2:30remaining
How many partitions after chained operations?
Given a DataFrame with 8 partitions, what is the number of partitions after these chained operations?
df2 = df.repartition(4).coalesce(2).repartition(5)
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [(i,) for i in range(20)] df = spark.createDataFrame(data, ['val']).repartition(8) df2 = df.repartition(4).coalesce(2).repartition(5) print(df2.rdd.getNumPartitions())
Attempts:
2 left
💡 Hint
Repartition sets partitions exactly; coalesce reduces partitions without shuffle.
✗ Incorrect
The last repartition(5) sets the number of partitions to 5, overriding previous steps.