Keeping first vs last vs none in Pandas - Performance Comparison
When working with data, we often remove duplicates. Choosing to keep the first, last, or no duplicates affects how long this takes.
We want to know how the time needed changes as the data grows.
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 2, 3, 3, 3],
'B': ['x', 'y', 'y', 'z', 'z', 'z']
})
# Remove duplicates, keep first
result_first = df.drop_duplicates(keep='first')
# Remove duplicates, keep last
result_last = df.drop_duplicates(keep='last')
# Remove all duplicates
result_none = df.drop_duplicates(keep=False)
This code removes duplicate rows from a DataFrame in three ways: keeping the first occurrence, the last occurrence, or removing all duplicates entirely.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Scanning through all rows to find duplicates.
- How many times: Each row is checked once to find duplicates and decide which to keep.
As the number of rows grows, the time to check duplicates grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: Doubling the data roughly doubles the work needed.
Time Complexity: O(n)
This means the time to remove duplicates grows linearly with the number of rows in the data.
[X] Wrong: "Keeping first or last duplicates changes the time complexity significantly."
[OK] Correct: Both operations scan the data once, so they take about the same time as data grows.
Understanding how data operations scale helps you explain your choices clearly and shows you know how to handle bigger data smoothly.
"What if we used multiple columns to find duplicates instead of one? How would the time complexity change?"