0
0
Apache Sparkdata~5 mins

Cross joins and when to avoid them in Apache Spark - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Cross joins and when to avoid them
O(n * m)
Understanding Time Complexity

Cross joins combine every row from one table with every row from another. This can create a huge number of results quickly.

We want to understand how the work grows as the input tables get bigger.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

val df1 = spark.read.csv("data1.csv")
val df2 = spark.read.csv("data2.csv")

val crossJoined = df1.crossJoin(df2)
crossJoined.show()

This code reads two datasets and creates a cross join, pairing every row from the first with every row from the second.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Pairing each row of the first dataset with every row of the second dataset.
  • How many times: For each row in the first dataset, it repeats for all rows in the second dataset.
How Execution Grows With Input

Explain the growth pattern intuitively.

Input Size (n)Approx. Operations
10 rows in each dataset100 pairs (10 x 10)
100 rows in each dataset10,000 pairs (100 x 100)
1,000 rows in each dataset1,000,000 pairs (1,000 x 1,000)

Pattern observation: The number of pairs grows very fast, multiplying the sizes of both datasets.

Final Time Complexity

Time Complexity: O(n * m)

This means the work grows by multiplying the number of rows in both datasets, so doubling one dataset doubles the total work.

Common Mistake

[X] Wrong: "Cross joins only add a small amount of extra work compared to normal joins."

[OK] Correct: Cross joins multiply the number of rows, which can quickly explode the amount of data and work, unlike normal joins that match rows based on keys.

Interview Connect

Understanding how cross joins grow helps you explain why some queries become slow and how to avoid costly operations in real projects.

Self-Check

"What if we replaced the cross join with an inner join on a key? How would the time complexity change?"