0
0
Pandasdata~5 mins

drop_duplicates() for removal in Pandas - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: drop_duplicates() for removal
O(n)
Understanding Time Complexity

When we remove duplicate rows from data using pandas, it is important to know how the time taken grows as the data gets bigger.

We want to understand how the work done changes when the number of rows increases.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 3, 3],
    'B': ['x', 'y', 'y', 'z', 'z', 'z']
})

unique_df = df.drop_duplicates()

This code creates a small table and removes duplicate rows, keeping only unique rows.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: Checking each row against others to find duplicates.
  • How many times: Each row is hashed and checked once during the process.
How Execution Grows With Input

As the number of rows grows, the time to find duplicates grows roughly in a straight line.

Input Size (n)Approx. Operations
10About 10 checks
100About 100 checks
1000About 1000 checks

Pattern observation: Doubling the rows roughly doubles the work needed.

Final Time Complexity

Time Complexity: O(n)

This means the time to remove duplicates grows directly in proportion to the number of rows.

Common Mistake

[X] Wrong: "Removing duplicates takes time proportional to the square of the number of rows because every row is compared to every other row."

[OK] Correct: pandas uses efficient hashing and indexing internally, so it does not compare every row to every other row directly.

Interview Connect

Understanding how data operations scale helps you write code that works well even with large datasets, a key skill in data science roles.

Self-Check

"What if we used drop_duplicates() on only one column instead of all columns? How would the time complexity change?"