Broadcast joins for small tables in Apache Spark - Time & Space Complexity
We want to understand how the time to join two tables changes when one table is small and broadcasted.
How does broadcasting affect the work Spark does as data grows?
Analyze the time complexity of the following code snippet.
from pyspark.sql.functions import broadcast
# large_df is a big table
# small_df is a small table
joined_df = large_df.join(broadcast(small_df), on='key')
This code joins a large table with a small table by sending the small table to all worker nodes.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning the large table rows and matching keys with the small broadcasted table.
- How many times: Once per row in the large table, since the small table is already available locally.
As the large table grows, the work grows roughly in direct proportion to its size.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 lookups in the small table |
| 100 | About 100 lookups in the small table |
| 1000 | About 1000 lookups in the small table |
Pattern observation: The number of operations grows linearly with the large table size, since the small table is quickly accessible everywhere.
Time Complexity: O(n)
This means the time grows directly with the size of the large table, as each row is checked once against the small table.
[X] Wrong: "Broadcast join time depends on the size of both tables equally."
[OK] Correct: The small table is sent once to all workers, so its size does not affect the repeated work. The large table size dominates the time.
Understanding broadcast joins helps you explain how Spark handles joins efficiently when one table is small, a useful skill in data engineering and analysis.
"What if both tables are large and broadcasting is not used? How would the time complexity change?"