How to Remove Duplicate Rows in pandas DataFrame
Use the
drop_duplicates() method on a pandas DataFrame to remove duplicate rows. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates by setting the subset and keep parameters.Syntax
The drop_duplicates() method removes duplicate rows from a DataFrame.
subset: Specify columns to consider for identifying duplicates. Default is all columns.keep: Decide which duplicates to keep:'first'(default),'last', orFalseto drop all duplicates.inplace: IfTrue, modifies the original DataFrame; otherwise returns a new one.
python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
Example
This example shows how to remove duplicate rows from a DataFrame. It demonstrates removing duplicates based on all columns and based on a specific column.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'], 'Age': [25, 30, 25, 40, 30], 'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']} df = pd.DataFrame(data) # Remove duplicates considering all columns, keep first occurrence unique_all = df.drop_duplicates() # Remove duplicates based on 'Name' column, keep last occurrence unique_name_last = df.drop_duplicates(subset=['Name'], keep='last') print('Original DataFrame:') print(df) print('\nAfter removing duplicates (all columns):') print(unique_all) print('\nAfter removing duplicates based on Name (keep last):') print(unique_name_last)
Output
Original DataFrame:
Name Age City
0 Alice 25 NY
1 Bob 30 LA
2 Alice 25 NY
3 David 40 Chicago
4 Bob 30 LA
After removing duplicates (all columns):
Name Age City
0 Alice 25 NY
1 Bob 30 LA
3 David 40 Chicago
After removing duplicates based on Name (keep last):
Name Age City
2 Alice 25 NY
4 Bob 30 LA
3 David 40 Chicago
Common Pitfalls
Common mistakes when removing duplicates include:
- Not specifying
subsetwhen you want to check duplicates only on certain columns. - Forgetting that
drop_duplicates()returns a new DataFrame unlessinplace=Trueis set. - Misunderstanding the
keepparameter, which controls which duplicates remain.
python
import pandas as pd data = {'A': [1, 1, 2], 'B': [3, 3, 4]} df = pd.DataFrame(data) # Wrong: drop_duplicates called but result not saved or inplace not set # This does NOT remove duplicates from df df.drop_duplicates() print('DataFrame after drop_duplicates without assignment:') print(df) # Right: assign back or use inplace=True df_clean = df.drop_duplicates() print('\nDataFrame after drop_duplicates with assignment:') print(df_clean)
Output
DataFrame after drop_duplicates without assignment:
A B
0 1 3
1 1 3
2 2 4
DataFrame after drop_duplicates with assignment:
A B
0 1 3
2 2 4
Quick Reference
Summary of drop_duplicates() parameters:
| Parameter | Description | Default |
|---|---|---|
| subset | Columns to consider for identifying duplicates | None (all columns) |
| keep | Which duplicates to keep: 'first', 'last', or False (drop all duplicates) | 'first' |
| inplace | Modify original DataFrame if True | False |
Key Takeaways
Use df.drop_duplicates() to remove duplicate rows from a pandas DataFrame.
Specify subset to check duplicates on specific columns only.
Remember drop_duplicates returns a new DataFrame unless inplace=True is set.
Use keep='first' or keep='last' to control which duplicate rows remain.
Check your DataFrame after removing duplicates to confirm changes.