0
0
PandasHow-ToBeginner · 3 min read

How to Find Duplicate Rows in pandas DataFrame

Use the duplicated() method in pandas to find duplicate rows in a DataFrame. It returns a boolean Series indicating which rows are duplicates. You can filter the DataFrame using this Series to see or remove duplicates.
📐

Syntax

The duplicated() method checks for duplicate rows in a DataFrame and returns a boolean Series. Key parameters include:

  • subset: Specify columns to check for duplicates instead of all columns.
  • keep: Controls which duplicates to mark as True. Options are 'first' (default), 'last', or False (mark all duplicates).
python
DataFrame.duplicated(subset=None, keep='first')
💻

Example

This example shows how to find duplicate rows in a pandas DataFrame and filter them out.

python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}
df = pd.DataFrame(data)

# Find duplicate rows
duplicates = df.duplicated()

# Show boolean Series indicating duplicates
print(duplicates)

# Filter and show only duplicate rows
print(df[duplicates])
Output
0 False 1 False 2 True 3 False 4 True dtype: bool Name Age City 2 Alice 25 NY 4 Bob 30 LA
⚠️

Common Pitfalls

Common mistakes when finding duplicates include:

  • Not specifying subset when you want to check duplicates only on certain columns.
  • Misunderstanding the keep parameter, which controls which duplicates are marked as True.
  • Forgetting that duplicated() marks duplicates as True except the first occurrence by default.
python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}
df = pd.DataFrame(data)

# Wrong: Checking duplicates without subset when only 'Name' matters
print(df.duplicated())

# Right: Check duplicates only on 'Name'
print(df.duplicated(subset=['Name']))

# Using keep=False to mark all duplicates
print(df.duplicated(keep=False))
Output
0 False 1 False 2 True 3 False 4 True dtype: bool 0 False 1 False 2 True 3 False 4 True dtype: bool 0 True 1 True 2 True 3 False 4 True dtype: bool
📊

Quick Reference

Summary of duplicated() parameters:

ParameterDescriptionDefault
subsetColumns to consider for identifying duplicatesNone (all columns)
keepWhich duplicates to mark as True: 'first', 'last', or False (all duplicates)'first'
inplaceWhether to modify the DataFrame in place (not used with duplicated())False

Key Takeaways

Use pandas.DataFrame.duplicated() to find duplicate rows as a boolean Series.
Specify subset to check duplicates on specific columns only.
The keep parameter controls which duplicates are marked True; 'first' keeps the first occurrence.
Filter the DataFrame with the boolean Series to view or remove duplicates.
Use keep=False to mark all duplicates as True if you want to see every repeated row.