Removing duplicates (drop_duplicates) in Data Analysis Python - Time & Space Complexity
When we remove duplicates from data, we want to know how the time to do this changes as the data grows.
We ask: How much longer does it take if we have more rows?
Analyze the time complexity of the following code snippet.
import pandas as pd
data = pd.DataFrame({
'A': [1, 2, 2, 3, 4, 4, 4],
'B': ['x', 'y', 'y', 'z', 'x', 'x', 'x']
})
unique_data = data.drop_duplicates()
This code creates a table and removes rows that appear more than once.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Checking each row against others to find duplicates.
- How many times: Each row is hashed or checked once during the process.
As the number of rows grows, the work to find duplicates grows roughly in a straight line.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: Doubling the rows roughly doubles the work.
Time Complexity: O(n)
This means the time to remove duplicates grows directly with the number of rows.
[X] Wrong: "Removing duplicates takes much longer than just reading the data because it compares every row to every other row."
[OK] Correct: The method uses smart hashing or indexing to avoid checking every pair, so it does not compare all rows to all others.
Understanding how removing duplicates scales helps you explain data cleaning steps clearly and shows you know how data size affects processing time.
"What if we remove duplicates based on only one column instead of all columns? How would the time complexity change?"