0
0
Pandasdata~5 mins

drop_duplicates() for removal in Pandas

Choose your learning style9 modes available
Introduction

We use drop_duplicates() to remove repeated rows from data. This helps keep data clean and easier to understand.

When you have a list of customer orders and want to see each order only once.
When you collect survey answers and want to remove repeated responses.
When you combine data from different sources and want to avoid counting the same item twice.
When you want to prepare data for analysis and need unique records.
When cleaning data before making charts or reports.
Syntax
Pandas
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset lets you choose columns to check for duplicates. If None, all columns are checked.

keep decides which duplicate to keep: 'first' keeps the first, 'last' keeps the last, and False drops all duplicates.

Examples
Remove duplicate rows considering all columns, keep the first occurrence.
Pandas
df.drop_duplicates()
Remove duplicates based only on 'Name' and 'Age' columns.
Pandas
df.drop_duplicates(subset=['Name', 'Age'])
Keep the last occurrence of each duplicate row.
Pandas
df.drop_duplicates(keep='last')
Remove duplicates and change the original DataFrame directly.
Pandas
df.drop_duplicates(inplace=True)
Sample Program

This code creates a small table with repeated rows. Then it removes duplicates and shows the cleaned table.

Pandas
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Remove duplicate rows keeping the first occurrence
unique_df = df.drop_duplicates()

print('\nDataFrame after drop_duplicates():')
print(unique_df)
OutputSuccess
Important Notes

Using inplace=True changes the original DataFrame without needing to assign it again.

If you want to remove duplicates based on some columns but keep others, use the subset parameter.

Remember that drop_duplicates() keeps the first occurrence by default.

Summary

drop_duplicates() removes repeated rows from data.

You can choose which columns to check and which duplicates to keep.

It helps clean data before analysis or reporting.