0
0
Data Analysis Pythondata~5 mins

Handling duplicate column names in Data Analysis Python - Time & Space Complexity

Choose your learning style9 modes available
Time Complexity: Handling duplicate column names
O(n)
Understanding Time Complexity

When working with data tables, sometimes columns have the same name. We want to understand how the time to fix these duplicates changes as the table grows.

How does the work needed to find and rename duplicates grow with more columns?

Scenario Under Consideration

Analyze the time complexity of the following code snippet.


import pandas as pd

def rename_duplicates(df):
    cols = df.columns.tolist()
    seen = {}
    for i, col in enumerate(cols):
        if col in seen:
            seen[col] += 1
            cols[i] = f"{col}_{seen[col]}"
        else:
            seen[col] = 0
    df.columns = cols
    return df
    

This code renames duplicate column names by adding a number suffix to make each name unique.

Identify Repeating Operations

Identify the loops, recursion, array traversals that repeat.

  • Primary operation: A single loop over all column names.
  • How many times: Once for each column in the list.
How Execution Grows With Input

As the number of columns grows, the loop runs once per column, doing a quick check and update each time.

Input Size (n)Approx. Operations
10About 10 checks and updates
100About 100 checks and updates
1000About 1000 checks and updates

Pattern observation: The work grows directly with the number of columns. Double the columns, double the work.

Final Time Complexity

Time Complexity: O(n)

This means the time to rename duplicates grows in a straight line with the number of columns.

Common Mistake

[X] Wrong: "Checking for duplicates inside the loop makes this code slower than linear time."

[OK] Correct: The dictionary lookup for seen columns is very fast and happens once per column, so the total work still grows linearly.

Interview Connect

Understanding how to handle duplicates efficiently shows you can work with real messy data and keep your code running smoothly as data grows.

Self-Check

"What if we used a nested loop to compare each column with every other column? How would the time complexity change?"