Inner join behavior in Pandas - Time & Space Complexity
When we combine two tables using an inner join, we want to know how the time it takes grows as the tables get bigger.
We ask: How does the work increase when the number of rows in each table increases?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Two dataframes with columns 'key' and 'value'
n = 1000
df1 = pd.DataFrame({'key': range(n), 'value': range(n)})
df2 = pd.DataFrame({'key': range(n), 'value': range(n, 2*n)})
# Perform inner join on 'key'
result = pd.merge(df1, df2, on='key', how='inner')
This code joins two tables on a common column, keeping only rows with matching keys.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Matching keys between the two tables.
- How many times: Each row in the first table is checked against keys in the second table.
As the number of rows in each table grows, the work to find matching keys grows roughly in proportion to the size of the tables.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks per table |
| 100 | About 100 checks per table |
| 1000 | About 1000 checks per table |
Pattern observation: The work grows roughly in a straight line as the tables get bigger.
Time Complexity: O(n)
This means the time to join grows directly with the number of rows in the tables.
[X] Wrong: "Joining two tables always takes time proportional to the product of their sizes."
[OK] Correct: Pandas uses efficient methods like hashing to avoid checking every pair, so it usually runs in time proportional to the total number of rows, not their product.
Understanding how joins scale helps you explain data processing choices clearly and shows you know how to handle bigger datasets efficiently.
"What if we changed the join type from inner to outer? How would the time complexity change?"