How to find duplicate rows pandas

PandasHow-ToBeginner · 3 min read

How to Find Duplicate Rows in pandas DataFrame

Use the duplicated() method in pandas to find duplicate rows in a DataFrame. It returns a boolean Series indicating which rows are duplicates. You can filter the DataFrame using this Series to see or remove duplicates.

📐

Syntax

The duplicated() method checks for duplicate rows in a DataFrame and returns a boolean Series. Key parameters include:

subset: Specify columns to check for duplicates instead of all columns.
keep: Controls which duplicates to mark as True. Options are 'first' (default), 'last', or False (mark all duplicates).

python

DataFrame.duplicated(subset=None, keep='first')

💻

Example

This example shows how to find duplicate rows in a pandas DataFrame and filter them out.

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}
df = pd.DataFrame(data)

# Find duplicate rows
duplicates = df.duplicated()

# Show boolean Series indicating duplicates
print(duplicates)

# Filter and show only duplicate rows
print(df[duplicates])

Output

0 False 1 False 2 True 3 False 4 True dtype: bool Name Age City 2 Alice 25 NY 4 Bob 30 LA

⚠️

Common Pitfalls

Common mistakes when finding duplicates include:

Not specifying subset when you want to check duplicates only on certain columns.
Misunderstanding the keep parameter, which controls which duplicates are marked as True.
Forgetting that duplicated() marks duplicates as True except the first occurrence by default.

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}
df = pd.DataFrame(data)

# Wrong: Checking duplicates without subset when only 'Name' matters
print(df.duplicated())

# Right: Check duplicates only on 'Name'
print(df.duplicated(subset=['Name']))

# Using keep=False to mark all duplicates
print(df.duplicated(keep=False))

Output

0 False 1 False 2 True 3 False 4 True dtype: bool 0 False 1 False 2 True 3 False 4 True dtype: bool 0 True 1 True 2 True 3 False 4 True dtype: bool

📊

Quick Reference

Summary of duplicated() parameters:

Parameter	Description	Default
subset	Columns to consider for identifying duplicates	None (all columns)
keep	Which duplicates to mark as True: 'first', 'last', or False (all duplicates)	'first'
inplace	Whether to modify the DataFrame in place (not used with duplicated())	False

✅

Key Takeaways

Use pandas.DataFrame.duplicated() to find duplicate rows as a boolean Series.

Specify subset to check duplicates on specific columns only.

The keep parameter controls which duplicates are marked True; 'first' keeps the first occurrence.

Filter the DataFrame with the boolean Series to view or remove duplicates.

Use keep=False to mark all duplicates as True if you want to see every repeated row.