How to Use drop_duplicates in pandas to Remove Duplicate Rows
Use
drop_duplicates() in pandas to remove duplicate rows from a DataFrame. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates with the keep parameter.Syntax
The basic syntax of drop_duplicates() is:
subset: Specify columns to consider for identifying duplicates. Default is all columns.keep: Decide which duplicates to keep:'first'(default),'last', orFalseto drop all duplicates.inplace: IfTrue, modifies the DataFrame in place without returning a new one.ignore_index: IfTrue, resets the index in the returned DataFrame.
python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
Example
This example shows how to remove duplicate rows from a DataFrame using drop_duplicates(). It demonstrates keeping the first occurrence and dropping duplicates based on specific columns.
python
import pandas as pd data = {'Name': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 25, 35, 40], 'City': ['NY', 'LA', 'NY', 'LA', 'SF']} # Create DataFrame df = pd.DataFrame(data) # Remove duplicate rows keeping the first occurrence unique_df = df.drop_duplicates() # Remove duplicates based on 'Name' column, keep last occurrence unique_name_df = df.drop_duplicates(subset=['Name'], keep='last') print('Original DataFrame:') print(df) print('\nDataFrame after drop_duplicates():') print(unique_df) print('\nDataFrame after drop_duplicates(subset=["Name"], keep="last"):') print(unique_name_df)
Output
Original DataFrame:
Name Age City
0 Alice 25 NY
1 Bob 30 LA
2 Alice 25 NY
3 Bob 35 LA
4 Charlie 40 SF
DataFrame after drop_duplicates():
Name Age City
0 Alice 25 NY
1 Bob 30 LA
3 Bob 35 LA
4 Charlie 40 SF
DataFrame after drop_duplicates(subset=["Name"], keep="last"):
Name Age City
2 Alice 25 NY
3 Bob 35 LA
4 Charlie 40 SF
Common Pitfalls
Common mistakes when using drop_duplicates() include:
- Not specifying
subsetwhen you want to check duplicates only on certain columns, which may remove rows you want to keep. - Forgetting that
drop_duplicates()returns a new DataFrame unlessinplace=Trueis set. - Misunderstanding the
keepparameter, which controls which duplicate to keep.
Example of a common mistake and the correct way:
python
import pandas as pd data = {'A': [1, 2, 2, 3], 'B': [4, 5, 5, 6]} df = pd.DataFrame(data) # Wrong: expecting original df to change but it doesn't wrong = df.drop_duplicates() print('Original df after drop_duplicates() without inplace:') print(df) # Right: use inplace=True to modify original df df.drop_duplicates(inplace=True) print('\nOriginal df after drop_duplicates(inplace=True):') print(df)
Output
Original df after drop_duplicates() without inplace:
A B
0 1 4
1 2 5
2 2 5
3 3 6
Original df after drop_duplicates(inplace=True):
A B
0 1 4
1 2 5
3 3 6
Quick Reference
Here is a quick summary of key parameters for drop_duplicates():
| Parameter | Description | Default |
|---|---|---|
| subset | Columns to consider for identifying duplicates | None (all columns) |
| keep | Which duplicates to keep: 'first', 'last', or False (drop all duplicates) | 'first' |
| inplace | Modify the DataFrame in place if True | False |
| ignore_index | Reset index in the returned DataFrame if True | False |
Key Takeaways
Use drop_duplicates() to remove duplicate rows from a pandas DataFrame easily.
Specify subset to check duplicates only on certain columns.
Remember drop_duplicates() returns a new DataFrame unless inplace=True is set.
Use keep parameter to control which duplicate row to keep.
Use ignore_index=True to reset the index after dropping duplicates.