Data Analysis Pythondata~5 mins

Removing duplicates (drop_duplicates) in Data Analysis Python

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Sometimes data has repeated rows that we don't need. Removing duplicates helps us clean data and get accurate results.

You have a list of customer orders and want to count each customer only once.

You collected survey responses but some people submitted multiple times.

You merged two datasets and want to remove repeated rows.

You want to prepare data for analysis without repeated entries.

Syntax

Data Analysis Python

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset lets you choose columns to check for duplicates. If None, all columns are checked.

keep decides which duplicate to keep: 'first' keeps the first, 'last' keeps the last, and False drops all duplicates.

Examples

Remove duplicate rows considering all columns, keep the first occurrence.

Data Analysis Python

df.drop_duplicates()

Remove duplicates based only on 'Name' and 'Age' columns.

Data Analysis Python

df.drop_duplicates(subset=['Name', 'Age'])

Keep the last occurrence of each duplicate row.

Data Analysis Python

df.drop_duplicates(keep='last')

Remove duplicates and change the original DataFrame directly.

Data Analysis Python

df.drop_duplicates(inplace=True)

Sample Program

This code creates a small table with repeated rows. Then it removes duplicates and shows the cleaned table.

Data Analysis Python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Remove duplicate rows
unique_df = df.drop_duplicates()

print('\nDataFrame after removing duplicates:')
print(unique_df)

OutputSuccess

Important Notes

By default, drop_duplicates() returns a new DataFrame and does not change the original.

Use inplace=True if you want to modify the original DataFrame directly.

Checking duplicates on specific columns helps when only some columns matter for uniqueness.

Summary

Use drop_duplicates() to remove repeated rows from data.

You can choose which columns to check and which duplicates to keep.

Removing duplicates helps make data clean and ready for analysis.