0
0
Pandasdata~5 mins

duplicated() for finding duplicates in Pandas

Choose your learning style9 modes available
Introduction

We use duplicated() to find repeated rows in data. It helps spot copies or repeated information easily.

When cleaning a list of customer records to remove repeated entries.
When checking survey responses to find if someone answered twice.
When analyzing sales data to find duplicate transactions.
When preparing data for analysis and want to avoid counting duplicates.
When merging data and want to check if duplicates appeared after combining.
Syntax
Pandas
DataFrame.duplicated(subset=None, keep='first')

subset lets you choose columns to check for duplicates. If None, all columns are checked.

keep decides which duplicates to mark as False: 'first' keeps first occurrence, 'last' keeps last, False marks all duplicates True.

Examples
Find duplicates considering all columns, mark all but first occurrence as True.
Pandas
df.duplicated()
Find duplicates only based on 'Name' and 'Age' columns.
Pandas
df.duplicated(subset=['Name', 'Age'])
Mark duplicates as True except the last occurrence.
Pandas
df.duplicated(keep='last')
Mark all duplicates as True, including the first occurrence.
Pandas
df.duplicated(keep=False)
Sample Program

This code creates a small table of people with their age and city. It then finds duplicates in two ways: first by all columns, second by just name and age, marking all duplicates.

Pandas
import pandas as pd

data = {'Name': ['Anna', 'Bob', 'Anna', 'Mike', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}
df = pd.DataFrame(data)

# Find duplicates considering all columns
duplicates_all = df.duplicated()

# Find duplicates based on Name and Age only
duplicates_name_age = df.duplicated(subset=['Name', 'Age'], keep=False)

print('Duplicates (all columns):')
print(duplicates_all)
print('\nDuplicates (Name and Age, all duplicates marked):')
print(duplicates_name_age)
OutputSuccess
Important Notes

Use duplicated() before removing duplicates to understand your data better.

Remember that duplicated() returns a boolean Series, not the duplicate rows themselves.

Combine with drop_duplicates() to remove duplicates after finding them.

Summary

duplicated() helps find repeated rows in data.

You can check duplicates by all columns or specific columns.

It returns True for duplicates and False for unique rows, depending on the keep setting.