Which of the following best explains why using a broadcast join can improve Spark job performance?
Think about how data movement affects network traffic in distributed systems.
Broadcast join improves performance by sending the smaller dataset to all nodes, so the join can happen locally without expensive shuffles.
What will be the output count of the joined DataFrame when using a broadcast join on a small dataset?
from pyspark.sql import SparkSession from pyspark.sql.functions import broadcast spark = SparkSession.builder.appName('JoinTest').getOrCreate() large_df = spark.createDataFrame([(i, f'value_{i}') for i in range(1000)], ['id', 'val']) small_df = spark.createDataFrame([(i, f'small_{i}') for i in range(10)], ['id', 'small_val']) joined_df = large_df.join(broadcast(small_df), 'id') print(joined_df.count())
Consider how many matching keys exist between the two datasets.
The join matches rows where 'id' is in both datasets. Since small_df has 10 ids, only those 10 rows join.
Given two large DataFrames joined using shuffle hash join, what is the expected effect on network data transfer?
Think about what shuffle means in Spark joins.
Shuffle hash join requires redistributing both datasets across the cluster, causing high network data transfer.
Which of the following is the most likely cause of slow join performance in this Spark job?
df1.join(df2, 'key')
Consider what happens when Spark joins large datasets without optimization.
When both datasets are large, Spark performs shuffle join which involves expensive data movement and slows down the job.
You have two datasets: one with 1 million rows and another with 100 rows. Which join strategy should you choose to optimize performance in Spark?
Think about minimizing data movement and network cost.
Broadcast join is best when one dataset is small; it avoids shuffling the large dataset and speeds up the join.