What is drop_duplicates() for removal in Pandas?

Pandasdata~5 mins

drop_duplicates() for removal in Pandas

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

We use drop_duplicates() to remove repeated rows from data. This helps keep data clean and easier to understand.

When you have a list of customer orders and want to see each order only once.

When you collect survey answers and want to remove repeated responses.

When you combine data from different sources and want to avoid counting the same item twice.

When you want to prepare data for analysis and need unique records.

When cleaning data before making charts or reports.

Syntax

Pandas

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset lets you choose columns to check for duplicates. If None, all columns are checked.

keep decides which duplicate to keep: 'first' keeps the first, 'last' keeps the last, and False drops all duplicates.

Examples

Remove duplicate rows considering all columns, keep the first occurrence.

Pandas

df.drop_duplicates()

Remove duplicates based only on 'Name' and 'Age' columns.

Pandas

df.drop_duplicates(subset=['Name', 'Age'])

Keep the last occurrence of each duplicate row.

Pandas

df.drop_duplicates(keep='last')

Remove duplicates and change the original DataFrame directly.

Pandas

df.drop_duplicates(inplace=True)

Sample Program

This code creates a small table with repeated rows. Then it removes duplicates and shows the cleaned table.

Pandas

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Remove duplicate rows keeping the first occurrence
unique_df = df.drop_duplicates()

print('\nDataFrame after drop_duplicates():')
print(unique_df)

OutputSuccess

Important Notes

Using inplace=True changes the original DataFrame without needing to assign it again.

If you want to remove duplicates based on some columns but keep others, use the subset parameter.

Remember that drop_duplicates() keeps the first occurrence by default.

Summary

drop_duplicates() removes repeated rows from data.

You can choose which columns to check and which duplicates to keep.

It helps clean data before analysis or reporting.