Pandasdata~5 mins

Keeping first vs last vs none in Pandas - Performance Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Time Complexity: Keeping first vs last vs none

O(n)

Understanding Time Complexity

When working with data, we often remove duplicates. Choosing to keep the first, last, or no duplicates affects how long this takes.

We want to know how the time needed changes as the data grows.

Scenario Under Consideration

Analyze the time complexity of the following code snippet.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 3, 3],
    'B': ['x', 'y', 'y', 'z', 'z', 'z']
})

# Remove duplicates, keep first
result_first = df.drop_duplicates(keep='first')

# Remove duplicates, keep last
result_last = df.drop_duplicates(keep='last')

# Remove all duplicates
result_none = df.drop_duplicates(keep=False)

This code removes duplicate rows from a DataFrame in three ways: keeping the first occurrence, the last occurrence, or removing all duplicates entirely.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

Primary operation: Scanning through all rows to find duplicates.
How many times: Each row is checked once to find duplicates and decide which to keep.

How Execution Grows With Input

As the number of rows grows, the time to check duplicates grows roughly in direct proportion.

Input Size (n)	Approx. Operations
10	About 10 checks
100	About 100 checks
1000	About 1000 checks

Pattern observation: Doubling the data roughly doubles the work needed.

Final Time Complexity

Time Complexity: O(n)

This means the time to remove duplicates grows linearly with the number of rows in the data.

Common Mistake

[X] Wrong: "Keeping first or last duplicates changes the time complexity significantly."

[OK] Correct: Both operations scan the data once, so they take about the same time as data grows.

Interview Connect

Understanding how data operations scale helps you explain your choices clearly and shows you know how to handle bigger data smoothly.

Self-Check

"What if we used multiple columns to find duplicates instead of one? How would the time complexity change?"