drop_duplicates() for removal in Pandas - Time & Space Complexity
When we remove duplicate rows from data using pandas, it is important to know how the time taken grows as the data gets bigger.
We want to understand how the work done changes when the number of rows increases.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 2, 3, 3, 3],
'B': ['x', 'y', 'y', 'z', 'z', 'z']
})
unique_df = df.drop_duplicates()
This code creates a small table and removes duplicate rows, keeping only unique rows.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Checking each row against others to find duplicates.
- How many times: Each row is hashed and checked once during the process.
As the number of rows grows, the time to find duplicates grows roughly in a straight line.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: Doubling the rows roughly doubles the work needed.
Time Complexity: O(n)
This means the time to remove duplicates grows directly in proportion to the number of rows.
[X] Wrong: "Removing duplicates takes time proportional to the square of the number of rows because every row is compared to every other row."
[OK] Correct: pandas uses efficient hashing and indexing internally, so it does not compare every row to every other row directly.
Understanding how data operations scale helps you write code that works well even with large datasets, a key skill in data science roles.
"What if we used drop_duplicates() on only one column instead of all columns? How would the time complexity change?"