0
0
Data Analysis Pythondata~5 mins

Removing duplicates (drop_duplicates) in Data Analysis Python

Choose your learning style9 modes available
Introduction

Sometimes data has repeated rows that we don't need. Removing duplicates helps us clean data and get accurate results.

You have a list of customer orders and want to count each customer only once.
You collected survey responses but some people submitted multiple times.
You merged two datasets and want to remove repeated rows.
You want to prepare data for analysis without repeated entries.
Syntax
Data Analysis Python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset lets you choose columns to check for duplicates. If None, all columns are checked.

keep decides which duplicate to keep: 'first' keeps the first, 'last' keeps the last, and False drops all duplicates.

Examples
Remove duplicate rows considering all columns, keep the first occurrence.
Data Analysis Python
df.drop_duplicates()
Remove duplicates based only on 'Name' and 'Age' columns.
Data Analysis Python
df.drop_duplicates(subset=['Name', 'Age'])
Keep the last occurrence of each duplicate row.
Data Analysis Python
df.drop_duplicates(keep='last')
Remove duplicates and change the original DataFrame directly.
Data Analysis Python
df.drop_duplicates(inplace=True)
Sample Program

This code creates a small table with repeated rows. Then it removes duplicates and shows the cleaned table.

Data Analysis Python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Remove duplicate rows
unique_df = df.drop_duplicates()

print('\nDataFrame after removing duplicates:')
print(unique_df)
OutputSuccess
Important Notes

By default, drop_duplicates() returns a new DataFrame and does not change the original.

Use inplace=True if you want to modify the original DataFrame directly.

Checking duplicates on specific columns helps when only some columns matter for uniqueness.

Summary

Use drop_duplicates() to remove repeated rows from data.

You can choose which columns to check and which duplicates to keep.

Removing duplicates helps make data clean and ready for analysis.