Challenge - 5 Problems

🎖️

Partition Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

What is the number of partitions after this operation?

Given a Spark DataFrame with 4 partitions, what will be the number of partitions after applying df.repartition(6)?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')]
df = spark.createDataFrame(data, ['id', 'value'])
df = df.repartition(6)
num_partitions = df.rdd.getNumPartitions()
print(num_partitions)

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Why is partitioning important in Spark?

Which of the following best explains why partitioning data is important in Apache Spark?

AIt helps distribute data across nodes to enable parallel processing.

BIt compresses data to save storage space.

CIt encrypts data for security purposes.

DIt converts data into a different file format.

Attempts:

2 left

❓ data_output

advanced

2:00remaining

What is the output of this partition count code?

Consider this Spark code snippet. What will be printed as the number of partitions?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(i,) for i in range(10)]
df = spark.createDataFrame(data, ['num'])
df2 = df.coalesce(2)
print(df2.rdd.getNumPartitions())

B10

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in this partitioning code

What error will this Spark code raise?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(1, 'x'), (2, 'y')]
df = spark.createDataFrame(data, ['id', 'val'])
df2 = df.repartition(-3)
print(df2.rdd.getNumPartitions())

AValueError: Number of partitions must be positive

BTypeError: repartition argument must be a string

CNo error, prints 3

DRuntimeError: repartition failed due to negative value

Attempts:

2 left

🚀 Application

expert

2:30remaining

How many partitions after chained operations?

Given a DataFrame with 8 partitions, what is the number of partitions after these chained operations?

df2 = df.repartition(4).coalesce(2).repartition(5)

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [(i,) for i in range(20)]
df = spark.createDataFrame(data, ['val']).repartition(8)
df2 = df.repartition(4).coalesce(2).repartition(5)
print(df2.rdd.getNumPartitions())

Attempts:

2 left