Outer join in Data Analysis Python - Time & Space Complexity
When combining two tables with an outer join, we want to understand how the time to do this grows as the tables get bigger.
We ask: How does the work increase when the number of rows in each table increases?
Analyze the time complexity of the following code snippet.
import pandas as pd
def outer_join(df1, df2, key):
result = pd.merge(df1, df2, on=key, how='outer')
return result
# df1 and df2 are dataframes with n and m rows respectively
This code performs an outer join on two dataframes using a common key, combining all rows from both tables.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Matching rows from both tables based on the join key.
- How many times: Each row in the first table is compared to rows in the second table to find matches.
As the number of rows in each table grows, the work to find matching rows grows too.
| Input Size (n rows in df1, m rows in df2) | Approx. Operations |
|---|---|
| 10, 10 | About 100 comparisons |
| 100, 100 | About 10,000 comparisons |
| 1000, 1000 | About 1,000,000 comparisons |
Pattern observation: The number of operations grows roughly by the product of the sizes of the two tables.
Time Complexity: O(n * m)
This means the time to do the outer join grows roughly by multiplying the number of rows in the first table by the number of rows in the second table.
[X] Wrong: "The outer join time grows only with the size of one table."
[OK] Correct: Because the join must compare rows from both tables, the time depends on both sizes, not just one.
Understanding how joins scale helps you explain database performance and write efficient queries, a useful skill in many data roles.
"What if one table is indexed on the join key? How would the time complexity change?"