Challenge - 5 Problems

🎖️

Broadcast Join Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of broadcast join with small table

What is the output count of the following Spark code snippet that uses a broadcast join?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.getOrCreate()

# Large DataFrame
df_large = spark.createDataFrame([(1, 'A'), (2, 'B'), (3, 'C')], ['id', 'value'])

# Small DataFrame to broadcast
small_df = spark.createDataFrame([(1, 'X'), (3, 'Y')], ['id', 'desc'])

# Broadcast join
joined_df = df_large.join(broadcast(small_df), 'id', 'inner')

print(joined_df.count())

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Resulting DataFrame after broadcast join

What is the content of the DataFrame after this broadcast join?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.getOrCreate()

df1 = spark.createDataFrame([(10, 'apple'), (20, 'banana'), (30, 'cherry')], ['key', 'fruit'])
small_df = spark.createDataFrame([(10, 'red'), (30, 'dark red')], ['key', 'color'])

result = df1.join(broadcast(small_df), 'key', 'left')
result.show()

A[Row(key=10, fruit='apple', color='red'), Row(key=20, fruit='banana', color='yellow'), Row(key=30, fruit='cherry', color='dark red')]

B[Row(key=10, fruit='apple', color=None), Row(key=20, fruit='banana', color=None), Row(key=30, fruit='cherry', color=None)]

C[Row(key=10, fruit='apple', color='red'), Row(key=20, fruit='banana', color=None), Row(key=30, fruit='cherry', color='dark red')]

D[Row(key=10, fruit='apple', color='red'), Row(key=20, fruit='banana', color='red'), Row(key=30, fruit='cherry', color='dark red')]

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in broadcast join code

What error will this Spark code raise when executed?

Apache Spark

from pyspark.sql import SparkSession
from pyspark.sql.functions import broadcast

spark = SparkSession.builder.getOrCreate()

large_df = spark.createDataFrame([(1, 'A'), (2, 'B')], ['id', 'val'])
small_df = spark.createDataFrame([(1, 'X'), (2, 'Y')], ['key', 'desc'])

joined = large_df.join(broadcast(small_df), large_df.id == small_df.key)
joined.show()

ANo error, outputs joined DataFrame with 2 rows

BAnalysisException: 'Reference 'key' is ambiguous, could be: key, key.'

CTypeError: join condition must be a Column

DValueError: Cannot broadcast a DataFrame with more than 100MB

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Choosing broadcast join for performance

You have a large DataFrame with 10 million rows and a small DataFrame with 100 rows. Which Spark join strategy is best for performance?

AUse cartesian join to combine all rows from both DataFrames

BUse broadcast join to send the small DataFrame to all worker nodes

CUse shuffle join to redistribute both DataFrames across the cluster

DUse cross join with filter to reduce data size

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Effect of broadcast join on shuffle partitions

When using a broadcast join in Spark, what happens to the shuffle partitions compared to a regular shuffle join?

AShuffle partitions are eliminated for the broadcasted table, reducing shuffle overhead

BShuffle partitions increase because data is duplicated across nodes

CShuffle partitions remain the same as in a regular shuffle join

DShuffle partitions are replaced by a cartesian product

Attempts:

2 left