Challenge - 5 Problems

🎖️

Broadcast Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of broadcast variable usage in Spark

What is the output of this Spark code snippet using a broadcast variable?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('BroadcastTest').getOrCreate()
sc = spark.sparkContext

broadcastVar = sc.broadcast([1, 2, 3])
rdd = sc.parallelize([0, 1, 2, 3])
result = rdd.map(lambda x: broadcastVar.value[x] if x < len(broadcastVar.value) else -1).collect()
print(result)

A[0, 1, 2, 3]

B[1, 2, 3, -1]

C[1, 2, 3, 3]

DIndexError

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Purpose of broadcast variables in Spark

What is the main purpose of using broadcast variables in Apache Spark?

ATo efficiently share a read-only variable across all worker nodes without sending it with every task

BTo cache RDDs in memory for faster access

CTo shuffle data between partitions during a join operation

DTo store intermediate results on disk to avoid recomputation

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error with broadcast variable usage

What error will this Spark code raise?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('BroadcastError').getOrCreate()
sc = spark.sparkContext

broadcastVar = sc.broadcast({'a': 1, 'b': 2})
rdd = sc.parallelize(['a', 'b', 'c'])
result = rdd.map(lambda x: broadcastVar.value[x]).collect()
print(result)

AKeyError

BTypeError

CValueError

DNo error, output: [1, 2, None]

Attempts:

2 left

❓ data_output

advanced

2:00remaining

Result of modifying broadcast variable after creation

What is the output of this Spark code?

Apache Spark

from pyspark.sql import SparkSession
spark = SparkSession.builder.master('local').appName('BroadcastModify').getOrCreate()
sc = spark.sparkContext

broadcastVar = sc.broadcast([10, 20, 30])
broadcastVar.value.append(40)
rdd = sc.parallelize([0, 1, 2, 3])
result = rdd.map(lambda x: broadcastVar.value[x] if x < len(broadcastVar.value) else -1).collect()
print(result)

A[10, 20, 30, 40]

B[10, 20, 30]

CAttributeError

D[10, 20, 30, -1]

Attempts:

2 left

🚀 Application

expert

2:30remaining

Best use case for broadcast variables in a join operation

You have a large dataset A and a small dataset B. You want to join them in Spark efficiently. Which approach best uses broadcast variables?

AUse a standard shuffle join without broadcasting

BBroadcast the large dataset A and map over B to join using the broadcasted data

CBroadcast the small dataset B and map over A to join using the broadcasted data

DCollect both datasets to the driver and join locally

Attempts:

2 left