Merging on different column names in Pandas - Time & Space Complexity
We want to understand how the time needed to merge two tables changes as the tables get bigger.
Specifically, we ask: How does merging on columns with different names affect the work done?
Analyze the time complexity of the following code snippet.
import pandas as pd
# Create two dataframes with different column names for keys
df1 = pd.DataFrame({
'key1': [1, 2, 3],
'value1': ['A', 'B', 'C']
})
df2 = pd.DataFrame({
'key2': [1, 2, 4],
'value2': ['D', 'E', 'F']
})
# Merge on different column names using left_on and right_on
merged = pd.merge(df1, df2, left_on='key1', right_on='key2')
This code merges two tables by matching values from different column names.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Comparing each key in the first table to keys in the second table to find matches.
- How many times: For each row in the first table, the merge process looks up matching rows in the second table.
As the number of rows in both tables grows, the work to find matching keys grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 lookups |
| 100 | About 100 lookups |
| 1000 | About 1000 lookups |
Pattern observation: The number of operations grows roughly in direct proportion to the size of the input tables.
Time Complexity: O(n)
This means the time to merge grows linearly with the number of rows in the first table (assuming the second table's keys are indexed).
[X] Wrong: "Merging on different column names makes the merge slower than merging on same-named columns."
[OK] Correct: The column names do not affect how many comparisons happen; the merge uses the values, so the time depends on the number of rows, not the column names.
Understanding how merging scales helps you explain data joining tasks clearly and shows you know what happens behind the scenes.
"What if we merged two tables where one key column is not unique? How would the time complexity change?"