0
0
PandasHow-ToBeginner · 3 min read

How to Use drop_duplicates in pandas to Remove Duplicate Rows

Use drop_duplicates() in pandas to remove duplicate rows from a DataFrame. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates with the keep parameter.
📐

Syntax

The basic syntax of drop_duplicates() is:

  • subset: Specify columns to consider for identifying duplicates. Default is all columns.
  • keep: Decide which duplicates to keep: 'first' (default), 'last', or False to drop all duplicates.
  • inplace: If True, modifies the DataFrame in place without returning a new one.
  • ignore_index: If True, resets the index in the returned DataFrame.
python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
💻

Example

This example shows how to remove duplicate rows from a DataFrame using drop_duplicates(). It demonstrates keeping the first occurrence and dropping duplicates based on specific columns.

python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 25, 35, 40],
        'City': ['NY', 'LA', 'NY', 'LA', 'SF']}

# Create DataFrame
df = pd.DataFrame(data)

# Remove duplicate rows keeping the first occurrence
unique_df = df.drop_duplicates()

# Remove duplicates based on 'Name' column, keep last occurrence
unique_name_df = df.drop_duplicates(subset=['Name'], keep='last')

print('Original DataFrame:')
print(df)
print('\nDataFrame after drop_duplicates():')
print(unique_df)
print('\nDataFrame after drop_duplicates(subset=["Name"], keep="last"):')
print(unique_name_df)
Output
Original DataFrame: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Alice 25 NY 3 Bob 35 LA 4 Charlie 40 SF DataFrame after drop_duplicates(): Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 Bob 35 LA 4 Charlie 40 SF DataFrame after drop_duplicates(subset=["Name"], keep="last"): Name Age City 2 Alice 25 NY 3 Bob 35 LA 4 Charlie 40 SF
⚠️

Common Pitfalls

Common mistakes when using drop_duplicates() include:

  • Not specifying subset when you want to check duplicates only on certain columns, which may remove rows you want to keep.
  • Forgetting that drop_duplicates() returns a new DataFrame unless inplace=True is set.
  • Misunderstanding the keep parameter, which controls which duplicate to keep.

Example of a common mistake and the correct way:

python
import pandas as pd

data = {'A': [1, 2, 2, 3], 'B': [4, 5, 5, 6]}
df = pd.DataFrame(data)

# Wrong: expecting original df to change but it doesn't
wrong = df.drop_duplicates()
print('Original df after drop_duplicates() without inplace:')
print(df)

# Right: use inplace=True to modify original df

df.drop_duplicates(inplace=True)
print('\nOriginal df after drop_duplicates(inplace=True):')
print(df)
Output
Original df after drop_duplicates() without inplace: A B 0 1 4 1 2 5 2 2 5 3 3 6 Original df after drop_duplicates(inplace=True): A B 0 1 4 1 2 5 3 3 6
📊

Quick Reference

Here is a quick summary of key parameters for drop_duplicates():

ParameterDescriptionDefault
subsetColumns to consider for identifying duplicatesNone (all columns)
keepWhich duplicates to keep: 'first', 'last', or False (drop all duplicates)'first'
inplaceModify the DataFrame in place if TrueFalse
ignore_indexReset index in the returned DataFrame if TrueFalse

Key Takeaways

Use drop_duplicates() to remove duplicate rows from a pandas DataFrame easily.
Specify subset to check duplicates only on certain columns.
Remember drop_duplicates() returns a new DataFrame unless inplace=True is set.
Use keep parameter to control which duplicate row to keep.
Use ignore_index=True to reset the index after dropping duplicates.