Handling duplicate column names in Data Analysis Python - Time & Space Complexity
When working with data tables, sometimes columns have the same name. We want to understand how the time to fix these duplicates changes as the table grows.
How does the work needed to find and rename duplicates grow with more columns?
Analyze the time complexity of the following code snippet.
import pandas as pd
def rename_duplicates(df):
cols = df.columns.tolist()
seen = {}
for i, col in enumerate(cols):
if col in seen:
seen[col] += 1
cols[i] = f"{col}_{seen[col]}"
else:
seen[col] = 0
df.columns = cols
return df
This code renames duplicate column names by adding a number suffix to make each name unique.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: A single loop over all column names.
- How many times: Once for each column in the list.
As the number of columns grows, the loop runs once per column, doing a quick check and update each time.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | About 10 checks and updates |
| 100 | About 100 checks and updates |
| 1000 | About 1000 checks and updates |
Pattern observation: The work grows directly with the number of columns. Double the columns, double the work.
Time Complexity: O(n)
This means the time to rename duplicates grows in a straight line with the number of columns.
[X] Wrong: "Checking for duplicates inside the loop makes this code slower than linear time."
[OK] Correct: The dictionary lookup for seen columns is very fast and happens once per column, so the total work still grows linearly.
Understanding how to handle duplicates efficiently shows you can work with real messy data and keep your code running smoothly as data grows.
"What if we used a nested loop to compare each column with every other column? How would the time complexity change?"