How to Find Duplicate Rows in pandas DataFrame
Use the
duplicated() method in pandas to find duplicate rows in a DataFrame. It returns a boolean Series indicating which rows are duplicates. You can filter the DataFrame using this Series to see or remove duplicates.Syntax
The duplicated() method checks for duplicate rows in a DataFrame and returns a boolean Series. Key parameters include:
subset: Specify columns to check for duplicates instead of all columns.keep: Controls which duplicates to mark asTrue. Options are'first'(default),'last', orFalse(mark all duplicates).
python
DataFrame.duplicated(subset=None, keep='first')
Example
This example shows how to find duplicate rows in a pandas DataFrame and filter them out.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'], 'Age': [25, 30, 25, 40, 30], 'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']} df = pd.DataFrame(data) # Find duplicate rows duplicates = df.duplicated() # Show boolean Series indicating duplicates print(duplicates) # Filter and show only duplicate rows print(df[duplicates])
Output
0 False
1 False
2 True
3 False
4 True
dtype: bool
Name Age City
2 Alice 25 NY
4 Bob 30 LA
Common Pitfalls
Common mistakes when finding duplicates include:
- Not specifying
subsetwhen you want to check duplicates only on certain columns. - Misunderstanding the
keepparameter, which controls which duplicates are marked asTrue. - Forgetting that
duplicated()marks duplicates asTrueexcept the first occurrence by default.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'], 'Age': [25, 30, 25, 40, 30], 'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']} df = pd.DataFrame(data) # Wrong: Checking duplicates without subset when only 'Name' matters print(df.duplicated()) # Right: Check duplicates only on 'Name' print(df.duplicated(subset=['Name'])) # Using keep=False to mark all duplicates print(df.duplicated(keep=False))
Output
0 False
1 False
2 True
3 False
4 True
dtype: bool
0 False
1 False
2 True
3 False
4 True
dtype: bool
0 True
1 True
2 True
3 False
4 True
dtype: bool
Quick Reference
Summary of duplicated() parameters:
| Parameter | Description | Default |
|---|---|---|
| subset | Columns to consider for identifying duplicates | None (all columns) |
| keep | Which duplicates to mark as True: 'first', 'last', or False (all duplicates) | 'first' |
| inplace | Whether to modify the DataFrame in place (not used with duplicated()) | False |
Key Takeaways
Use pandas.DataFrame.duplicated() to find duplicate rows as a boolean Series.
Specify subset to check duplicates on specific columns only.
The keep parameter controls which duplicates are marked True; 'first' keeps the first occurrence.
Filter the DataFrame with the boolean Series to view or remove duplicates.
Use keep=False to mark all duplicates as True if you want to see every repeated row.