join functions (left_join, inner_join) in R Programming - Time & Space Complexity
When we use join functions like left_join or inner_join, we combine two tables based on matching values.
We want to know how the time needed grows as the tables get bigger.
Analyze the time complexity of the following R code using dplyr joins.
library(dplyr)
# Two example data frames
df1 <- data.frame(id = 1:1000, value1 = rnorm(1000))
df2 <- data.frame(id = sample(1:1500, 800), value2 = rnorm(800))
# Perform a left join on 'id'
result <- left_join(df1, df2, by = "id")
This code joins two tables by matching the id column, keeping all rows from the first table.
Look for repeated steps that take time as input grows.
- Primary operation: Matching each row in the first table to rows in the second table by key.
- How many times: For each of the
nrows in the first table, the join looks up matching rows in the second table.
As the first table grows, the join needs to find matches for more rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 lookups in second table |
| 100 | About 100 lookups in second table |
| 1000 | About 1000 lookups in second table |
Pattern observation: The number of operations grows roughly in direct proportion to the size of the first table.
Time Complexity: O(n)
This means the time to join grows linearly with the number of rows in the first table.
[X] Wrong: "Joining two tables always takes time proportional to the product of their sizes (like n x m)."
[OK] Correct: Efficient join functions use fast lookups (like hash tables) to avoid checking every pair, so time depends mostly on the first table size.
Understanding how join operations scale helps you explain data processing choices clearly and confidently in real projects or interviews.
"What if the second table is much larger than the first? How would that affect the time complexity of the join?"