duplicated() for finding duplicates in Pandas - Time & Space Complexity
We want to understand how the time needed to find duplicates grows as the data gets bigger.
How does pandas check each row to see if it appeared before?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'A': [1, 2, 2, 3, 4, 4, 4],
'B': ['x', 'y', 'y', 'z', 'x', 'x', 'x']
})
duplicates = data.duplicated()
print(duplicates)
This code creates a small table and finds which rows are duplicates of earlier rows.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Hashing each row and checking if it was seen before in a dictionary.
- How many times: Once for each row (O(1) average time per check).
As the number of rows grows, the work to find duplicates grows in proportion to the number of rows.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 operations |
| 100 | About 100 operations |
| 1000 | About 1,000 operations |
Pattern observation: Doubling the number of rows roughly doubles the work needed.
Time Complexity: O(n)
This means the time to find duplicates grows linearly as the data gets bigger, thanks to efficient hashing.
[X] Wrong: "Finding duplicates requires checking each row against all previous rows, so O(n²)."
[OK] Correct: Pandas uses hashing (dictionary lookups) for O(1) average checks per row, making it O(n) total.
Understanding how duplicate detection scales helps you explain data processing limits and choose better methods when needed.
"What if we only checked duplicates in one column instead of all columns? How would the time complexity change?"