Challenge - 5 Problems
Broadcast Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
❓ Predict Output
intermediate2:00remaining
Output of broadcast variable usage in Spark
What is the output of this Spark code snippet using a broadcast variable?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('BroadcastTest').getOrCreate() sc = spark.sparkContext broadcastVar = sc.broadcast([1, 2, 3]) rdd = sc.parallelize([0, 1, 2, 3]) result = rdd.map(lambda x: broadcastVar.value[x] if x < len(broadcastVar.value) else -1).collect() print(result)
Attempts:
2 left
💡 Hint
Remember that broadcast variables share data efficiently and you can access their value attribute.
✗ Incorrect
The broadcast variable holds the list [1, 2, 3]. The RDD elements 0,1,2 are used as indices to access this list. For index 3, which is out of range, the code returns -1.
🧠 Conceptual
intermediate1:30remaining
Purpose of broadcast variables in Spark
What is the main purpose of using broadcast variables in Apache Spark?
Attempts:
2 left
💡 Hint
Think about how to avoid sending the same data multiple times to workers.
✗ Incorrect
Broadcast variables allow Spark to send a large read-only value to all worker nodes only once, reducing communication overhead.
🔧 Debug
advanced2:00remaining
Identify the error with broadcast variable usage
What error will this Spark code raise?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('BroadcastError').getOrCreate() sc = spark.sparkContext broadcastVar = sc.broadcast({'a': 1, 'b': 2}) rdd = sc.parallelize(['a', 'b', 'c']) result = rdd.map(lambda x: broadcastVar.value[x]).collect() print(result)
Attempts:
2 left
💡 Hint
Check what happens if a key is missing in a dictionary lookup.
✗ Incorrect
The key 'c' is not in the broadcast dictionary, so accessing broadcastVar.value['c'] raises a KeyError.
❓ data_output
advanced2:00remaining
Result of modifying broadcast variable after creation
What is the output of this Spark code?
Apache Spark
from pyspark.sql import SparkSession spark = SparkSession.builder.master('local').appName('BroadcastModify').getOrCreate() sc = spark.sparkContext broadcastVar = sc.broadcast([10, 20, 30]) broadcastVar.value.append(40) rdd = sc.parallelize([0, 1, 2, 3]) result = rdd.map(lambda x: broadcastVar.value[x] if x < len(broadcastVar.value) else -1).collect() print(result)
Attempts:
2 left
💡 Hint
Modifications to broadcast variables on the driver after creation do not propagate to workers.
✗ Incorrect
The append(40) modifies the list only on the driver. Workers receive a copy of the list as it was at broadcast time ([10, 20, 30]), so len=3 on workers, and for x=3, -1 is returned.
🚀 Application
expert2:30remaining
Best use case for broadcast variables in a join operation
You have a large dataset A and a small dataset B. You want to join them in Spark efficiently. Which approach best uses broadcast variables?
Attempts:
2 left
💡 Hint
Broadcasting the smaller dataset reduces shuffle and speeds up join.
✗ Incorrect
Broadcasting the small dataset B allows each worker to join with it locally, avoiding expensive shuffles of the large dataset A.