0
0
Apache Sparkdata~5 mins

Broadcast joins for small tables in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Broadcast joins for small tables
O(n)
Understanding Time Complexity

We want to understand how the time to join two tables changes when one table is small and broadcasted.

How does broadcasting affect the work Spark does as data grows?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

from pyspark.sql.functions import broadcast

# large_df is a big table
# small_df is a small table
joined_df = large_df.join(broadcast(small_df), on='key')

This code joins a large table with a small table by sending the small table to all worker nodes.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Scanning the large table rows and matching keys with the small broadcasted table.
  • How many times: Once per row in the large table, since the small table is already available locally.
How Execution Grows With Input

As the large table grows, the work grows roughly in direct proportion to its size.

Input Size (n)Approx. Operations
10About 10 lookups in the small table
100About 100 lookups in the small table
1000About 1000 lookups in the small table

Pattern observation: The number of operations grows linearly with the large table size, since the small table is quickly accessible everywhere.

Final Time Complexity

Time Complexity: O(n)

This means the time grows directly with the size of the large table, as each row is checked once against the small table.

Common Mistake

[X] Wrong: "Broadcast join time depends on the size of both tables equally."

[OK] Correct: The small table is sent once to all workers, so its size does not affect the repeated work. The large table size dominates the time.

Interview Connect

Understanding broadcast joins helps you explain how Spark handles joins efficiently when one table is small, a useful skill in data engineering and analysis.

Self-Check

"What if both tables are large and broadcasting is not used? How would the time complexity change?"