How to Remove Duplicates in pandas DataFrames Easily
Use the
drop_duplicates() method on a pandas DataFrame to remove duplicate rows. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates by setting the subset and keep parameters.Syntax
The basic syntax to remove duplicates in pandas is:
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
subset: List of columns to consider for identifying duplicates. If None, all columns are used.
keep: Which duplicates to keep. Options are 'first' (default), 'last', or False (drop all duplicates).
inplace: If True, modifies the original DataFrame. Otherwise, returns a new DataFrame.
python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
Example
This example shows how to remove duplicate rows from a DataFrame. It keeps the first occurrence of each duplicate.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'], 'Age': [25, 30, 25, 40, 30], 'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']} df = pd.DataFrame(data) print('Original DataFrame:') print(df) # Remove duplicates keeping the first occurrence clean_df = df.drop_duplicates() print('\nDataFrame after removing duplicates:') print(clean_df)
Output
Original DataFrame:
Name Age City
0 Alice 25 NY
1 Bob 30 LA
2 Alice 25 NY
3 David 40 Chicago
4 Bob 30 LA
DataFrame after removing duplicates:
Name Age City
0 Alice 25 NY
1 Bob 30 LA
3 David 40 Chicago
Common Pitfalls
Common mistakes when removing duplicates include:
- Not specifying
subsetwhen you want to check duplicates only on certain columns. - Forgetting to set
inplace=Trueif you want to modify the original DataFrame. - Using
keep=Falsewithout realizing it removes all duplicates, leaving only unique rows.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'], 'Age': [25, 30, 25, 40, 30], 'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']} df = pd.DataFrame(data) # Wrong: Does not specify subset, but wants to remove duplicates only by 'Name' wrong = df.drop_duplicates() # Right: Specify subset to check duplicates only on 'Name' right = df.drop_duplicates(subset=['Name']) print('Wrong approach (checks all columns):') print(wrong) print('\nRight approach (checks only Name column):') print(right)
Output
Wrong approach (checks all columns):
Name Age City
0 Alice 25 NY
1 Bob 30 LA
3 David 40 Chicago
Right approach (checks only Name column):
Name Age City
0 Alice 25 NY
1 Bob 30 LA
3 David 40 Chicago
Quick Reference
Here is a quick summary of drop_duplicates() parameters:
| Parameter | Description | Default |
|---|---|---|
| subset | Columns to consider for duplicates | None (all columns) |
| keep | Which duplicates to keep: 'first', 'last', or False (drop all) | 'first' |
| inplace | Modify original DataFrame if True | False |
Key Takeaways
Use DataFrame.drop_duplicates() to remove duplicate rows easily.
Specify subset to check duplicates on specific columns only.
Set keep='first' or 'last' to control which duplicates remain.
Use inplace=True to modify the original DataFrame directly.
Remember keep=False removes all duplicates, leaving only unique rows.