0
0
Apache Sparkdata~20 mins

Broadcast variables in Apache Spark - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Broadcast Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of broadcast variable usage in Spark
What is the output of this Spark code snippet using a broadcast variable?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('BroadcastTest').getOrCreate()
sc = spark.sparkContext

broadcastVar = sc.broadcast([1, 2, 3])
rdd = sc.parallelize([0, 1, 2, 3])
result = rdd.map(lambda x: broadcastVar.value[x] if x < len(broadcastVar.value) else -1).collect()
print(result)
A[0, 1, 2, 3]
B[1, 2, 3, -1]
C[1, 2, 3, 3]
DIndexError
Attempts:
2 left
💡 Hint
Remember that broadcast variables share data efficiently and you can access their value attribute.
🧠 Conceptual
intermediate
1:30remaining
Purpose of broadcast variables in Spark
What is the main purpose of using broadcast variables in Apache Spark?
ATo efficiently share a read-only variable across all worker nodes without sending it with every task
BTo cache RDDs in memory for faster access
CTo shuffle data between partitions during a join operation
DTo store intermediate results on disk to avoid recomputation
Attempts:
2 left
💡 Hint
Think about how to avoid sending the same data multiple times to workers.
🔧 Debug
advanced
2:00remaining
Identify the error with broadcast variable usage
What error will this Spark code raise?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('BroadcastError').getOrCreate()
sc = spark.sparkContext

broadcastVar = sc.broadcast({'a': 1, 'b': 2})
rdd = sc.parallelize(['a', 'b', 'c'])
result = rdd.map(lambda x: broadcastVar.value[x]).collect()
print(result)
AKeyError
BTypeError
CValueError
DNo error, output: [1, 2, None]
Attempts:
2 left
💡 Hint
Check what happens if a key is missing in a dictionary lookup.
data_output
advanced
2:00remaining
Result of modifying broadcast variable after creation
What is the output of this Spark code?
Apache Spark
from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('BroadcastModify').getOrCreate()
sc = spark.sparkContext

broadcastVar = sc.broadcast([10, 20, 30])
broadcastVar.value.append(40)
rdd = sc.parallelize([0, 1, 2, 3])
result = rdd.map(lambda x: broadcastVar.value[x] if x < len(broadcastVar.value) else -1).collect()
print(result)
A[10, 20, 30, 40]
B[10, 20, 30]
CAttributeError
D[10, 20, 30, -1]
Attempts:
2 left
💡 Hint
Modifications to broadcast variables on the driver after creation do not propagate to workers.
🚀 Application
expert
2:30remaining
Best use case for broadcast variables in a join operation
You have a large dataset A and a small dataset B. You want to join them in Spark efficiently. Which approach best uses broadcast variables?
AUse a standard shuffle join without broadcasting
BBroadcast the large dataset A and map over B to join using the broadcasted data
CBroadcast the small dataset B and map over A to join using the broadcasted data
DCollect both datasets to the driver and join locally
Attempts:
2 left
💡 Hint
Broadcasting the smaller dataset reduces shuffle and speeds up join.