Counting duplicates in Pandas - Time & Space Complexity
We want to know how the time to count duplicates grows as the data gets bigger.
How does pandas find and count repeated rows or values efficiently?
Analyze the time complexity of the following code snippet.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 2, 3, 3, 3],
'B': ['x', 'y', 'y', 'z', 'z', 'z']
})
duplicates_count = df.duplicated().sum()
This code creates a small table and counts how many rows are duplicates.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Checking each row against previous rows to find duplicates.
- How many times: Once for each row in the data.
As the number of rows grows, pandas checks each row once to see if it appeared before.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks |
| 100 | About 100 checks |
| 1000 | About 1000 checks |
Pattern observation: The work grows directly with the number of rows.
Time Complexity: O(n)
This means the time to count duplicates grows in a straight line as the data size grows.
[X] Wrong: "Counting duplicates takes much longer than just looking at each row once."
[OK] Correct: pandas uses efficient methods to track seen rows, so it only needs to check each row once, not repeatedly compare all pairs.
Understanding how counting duplicates scales helps you explain data cleaning steps clearly and shows you know how pandas handles data efficiently.
"What if we counted duplicates based on only one column instead of all columns? How would the time complexity change?"