Cross joins and when to avoid them in Apache Spark - Time & Space Complexity
Cross joins combine every row from one table with every row from another. This can create a huge number of results quickly.
We want to understand how the work grows as the input tables get bigger.
Analyze the time complexity of the following code snippet.
val df1 = spark.read.csv("data1.csv")
val df2 = spark.read.csv("data2.csv")
val crossJoined = df1.crossJoin(df2)
crossJoined.show()
This code reads two datasets and creates a cross join, pairing every row from the first with every row from the second.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Pairing each row of the first dataset with every row of the second dataset.
- How many times: For each row in the first dataset, it repeats for all rows in the second dataset.
Explain the growth pattern intuitively.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 rows in each dataset | 100 pairs (10 x 10) |
| 100 rows in each dataset | 10,000 pairs (100 x 100) |
| 1,000 rows in each dataset | 1,000,000 pairs (1,000 x 1,000) |
Pattern observation: The number of pairs grows very fast, multiplying the sizes of both datasets.
Time Complexity: O(n * m)
This means the work grows by multiplying the number of rows in both datasets, so doubling one dataset doubles the total work.
[X] Wrong: "Cross joins only add a small amount of extra work compared to normal joins."
[OK] Correct: Cross joins multiply the number of rows, which can quickly explode the amount of data and work, unlike normal joins that match rows based on keys.
Understanding how cross joins grow helps you explain why some queries become slow and how to avoid costly operations in real projects.
"What if we replaced the cross join with an inner join on a key? How would the time complexity change?"