0
0
Pandasdata~5 mins

duplicated() for finding duplicates in Pandas - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: duplicated() for finding duplicates
O(n)
Understanding Time Complexity

We want to understand how the time needed to find duplicates grows as the data gets bigger.

How does pandas check each row to see if it appeared before?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

data = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4, 4],
    'B': ['x', 'y', 'y', 'z', 'x', 'x', 'x']
})

duplicates = data.duplicated()
print(duplicates)

This code creates a small table and finds which rows are duplicates of earlier rows.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Hashing each row and checking if it was seen before in a dictionary.
  • How many times: Once for each row (O(1) average time per check).
How Execution Grows With Input

As the number of rows grows, the work to find duplicates grows in proportion to the number of rows.

Input Size (n)Approx. Operations
10About 10 operations
100About 100 operations
1000About 1,000 operations

Pattern observation: Doubling the number of rows roughly doubles the work needed.

Final Time Complexity

Time Complexity: O(n)

This means the time to find duplicates grows linearly as the data gets bigger, thanks to efficient hashing.

Common Mistake

[X] Wrong: "Finding duplicates requires checking each row against all previous rows, so O(n²)."

[OK] Correct: Pandas uses hashing (dictionary lookups) for O(1) average checks per row, making it O(n) total.

Interview Connect

Understanding how duplicate detection scales helps you explain data processing limits and choose better methods when needed.

Self-Check

"What if we only checked duplicates in one column instead of all columns? How would the time complexity change?"