How to use drop_duplicates pandas

PandasHow-ToBeginner · 3 min read

How to Use drop_duplicates in pandas to Remove Duplicate Rows

Use drop_duplicates() in pandas to remove duplicate rows from a DataFrame. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates with the keep parameter.

📐

Syntax

The basic syntax of drop_duplicates() is:

subset: Specify columns to consider for identifying duplicates. Default is all columns.
keep: Decide which duplicates to keep: 'first' (default), 'last', or False to drop all duplicates.
inplace: If True, modifies the DataFrame in place without returning a new one.
ignore_index: If True, resets the index in the returned DataFrame.

python

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

💻

Example

This example shows how to remove duplicate rows from a DataFrame using drop_duplicates(). It demonstrates keeping the first occurrence and dropping duplicates based on specific columns.

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 25, 35, 40],
        'City': ['NY', 'LA', 'NY', 'LA', 'SF']}

# Create DataFrame
df = pd.DataFrame(data)

# Remove duplicate rows keeping the first occurrence
unique_df = df.drop_duplicates()

# Remove duplicates based on 'Name' column, keep last occurrence
unique_name_df = df.drop_duplicates(subset=['Name'], keep='last')

print('Original DataFrame:')
print(df)
print('\nDataFrame after drop_duplicates():')
print(unique_df)
print('\nDataFrame after drop_duplicates(subset=["Name"], keep="last"):')
print(unique_name_df)

Output

Original DataFrame: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Alice 25 NY 3 Bob 35 LA 4 Charlie 40 SF DataFrame after drop_duplicates(): Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 Bob 35 LA 4 Charlie 40 SF DataFrame after drop_duplicates(subset=["Name"], keep="last"): Name Age City 2 Alice 25 NY 3 Bob 35 LA 4 Charlie 40 SF

⚠️

Common Pitfalls

Common mistakes when using drop_duplicates() include:

Not specifying subset when you want to check duplicates only on certain columns, which may remove rows you want to keep.
Forgetting that drop_duplicates() returns a new DataFrame unless inplace=True is set.
Misunderstanding the keep parameter, which controls which duplicate to keep.

Example of a common mistake and the correct way:

python

import pandas as pd

data = {'A': [1, 2, 2, 3], 'B': [4, 5, 5, 6]}
df = pd.DataFrame(data)

# Wrong: expecting original df to change but it doesn't
wrong = df.drop_duplicates()
print('Original df after drop_duplicates() without inplace:')
print(df)

# Right: use inplace=True to modify original df

df.drop_duplicates(inplace=True)
print('\nOriginal df after drop_duplicates(inplace=True):')
print(df)

Output

Original df after drop_duplicates() without inplace: A B 0 1 4 1 2 5 2 2 5 3 3 6 Original df after drop_duplicates(inplace=True): A B 0 1 4 1 2 5 3 3 6

📊

Quick Reference

Here is a quick summary of key parameters for drop_duplicates():

Parameter	Description	Default
subset	Columns to consider for identifying duplicates	None (all columns)
keep	Which duplicates to keep: 'first', 'last', or False (drop all duplicates)	'first'
inplace	Modify the DataFrame in place if True	False
ignore_index	Reset index in the returned DataFrame if True	False

✅

Key Takeaways

Use drop_duplicates() to remove duplicate rows from a pandas DataFrame easily.

Specify subset to check duplicates only on certain columns.

Remember drop_duplicates() returns a new DataFrame unless inplace=True is set.

Use keep parameter to control which duplicate row to keep.

Use ignore_index=True to reset the index after dropping duplicates.