Duplicates on specific columns in Pandas - Time & Space Complexity
We want to know how the time needed to find duplicates on certain columns changes as the data grows.
How does the work increase when we have more rows in the table?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 2, 3, 4, 4, 4],
'B': ['x', 'y', 'y', 'z', 'x', 'x', 'x'],
'C': [10, 20, 20, 30, 40, 40, 40]
})
duplicates = df.duplicated(subset=['A', 'B'])
This code checks which rows have duplicate values in columns 'A' and 'B'.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: For each row, hashing values in columns 'A' and 'B' to check if seen before.
- How many times: Once per row.
As the number of rows grows, the work to check duplicates grows roughly in proportion to the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: The number of operations grows roughly in a straight line as rows increase.
Time Complexity: O(n)
This means the time to find duplicates grows directly with the number of rows.
[X] Wrong: "Finding duplicates on specific columns takes much longer than checking all columns."
[OK] Correct: The operation mainly depends on the number of rows, not how many columns you check, so checking fewer columns does not drastically change the time.
Understanding how duplicate checks scale helps you explain data cleaning steps clearly and shows you know how data size affects performance.
"What if we checked duplicates on all columns instead of specific ones? How would the time complexity change?"