Why combining datasets creates complete pictures in Data Analysis Python - Performance Analysis
When we combine datasets, we want to see the full story from different pieces of data.
We ask: How does the time to combine data grow as the datasets get bigger?
Analyze the time complexity of the following code snippet.
import pandas as pd
n = 10 # Define n to avoid NameError
# Two datasets with n rows each
left = pd.DataFrame({'key': range(n), 'value_left': range(n)})
right = pd.DataFrame({'key': range(n), 'value_right': range(n)})
# Naive combine datasets on 'key' using nested loops
combined_list = []
for row_l in left.itertuples():
for row_r in right.itertuples():
if row_l.key == row_r.key:
combined_list.append({
'key': row_l.key,
'value_left': row_l.value_left,
'value_right': row_r.value_right
})
combined = pd.DataFrame(combined_list)
This code merges two datasets by matching rows with the same key to create a complete picture.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Matching keys between two datasets to join rows.
- How many times: Each row in the first dataset is compared to rows in the second dataset to find matches.
As the number of rows (n) grows, the work to find matching keys grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 100 comparisons |
| 100 | About 10,000 comparisons |
| 1000 | About 1,000,000 comparisons |
Pattern observation: The number of operations grows much faster than the input size, roughly like the square of n.
Time Complexity: O(n²)
This means if you double the size of your datasets, the time to combine them roughly quadruples.
[X] Wrong: "Combining two datasets always takes time proportional to their size added together."
[OK] Correct: Actually, matching rows often requires checking many pairs, so time grows faster than just adding sizes.
Understanding how combining data grows with size helps you explain your approach clearly and shows you know what happens behind the scenes.
"What if the datasets were already sorted by the key? How would the time complexity change?"