0
0
Pandasdata~5 mins

Duplicates on specific columns in Pandas - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Duplicates on specific columns
O(n)
Understanding Time Complexity

We want to know how the time needed to find duplicates on certain columns changes as the data grows.

How does the work increase when we have more rows in the table?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4, 4],
    'B': ['x', 'y', 'y', 'z', 'x', 'x', 'x'],
    'C': [10, 20, 20, 30, 40, 40, 40]
})

duplicates = df.duplicated(subset=['A', 'B'])

This code checks which rows have duplicate values in columns 'A' and 'B'.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: For each row, hashing values in columns 'A' and 'B' to check if seen before.
  • How many times: Once per row.
How Execution Grows With Input

As the number of rows grows, the work to check duplicates grows roughly in proportion to the number of rows.

Input Size (n)Approx. Operations
10About 10 checks
100About 100 checks
1000About 1000 checks

Pattern observation: The number of operations grows roughly in a straight line as rows increase.

Final Time Complexity

Time Complexity: O(n)

This means the time to find duplicates grows directly with the number of rows.

Common Mistake

[X] Wrong: "Finding duplicates on specific columns takes much longer than checking all columns."

[OK] Correct: The operation mainly depends on the number of rows, not how many columns you check, so checking fewer columns does not drastically change the time.

Interview Connect

Understanding how duplicate checks scale helps you explain data cleaning steps clearly and shows you know how data size affects performance.

Self-Check

"What if we checked duplicates on all columns instead of specific ones? How would the time complexity change?"