PandasHow-ToBeginner · 3 min read

How to Remove Duplicates in pandas DataFrames Easily

Use the drop_duplicates() method on a pandas DataFrame to remove duplicate rows. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates by setting the subset and keep parameters.

📐

Syntax

The basic syntax to remove duplicates in pandas is:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset: List of columns to consider for identifying duplicates. If None, all columns are used.

keep: Which duplicates to keep. Options are 'first' (default), 'last', or False (drop all duplicates).

inplace: If True, modifies the original DataFrame. Otherwise, returns a new DataFrame.

python

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

💻

Example

This example shows how to remove duplicate rows from a DataFrame. It keeps the first occurrence of each duplicate.

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Remove duplicates keeping the first occurrence
clean_df = df.drop_duplicates()

print('\nDataFrame after removing duplicates:')
print(clean_df)

Output

Original DataFrame: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Alice 25 NY 3 David 40 Chicago 4 Bob 30 LA DataFrame after removing duplicates: Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago

⚠️

Common Pitfalls

Common mistakes when removing duplicates include:

Not specifying subset when you want to check duplicates only on certain columns.
Forgetting to set inplace=True if you want to modify the original DataFrame.
Using keep=False without realizing it removes all duplicates, leaving only unique rows.

python

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

# Wrong: Does not specify subset, but wants to remove duplicates only by 'Name'
wrong = df.drop_duplicates()

# Right: Specify subset to check duplicates only on 'Name'
right = df.drop_duplicates(subset=['Name'])

print('Wrong approach (checks all columns):')
print(wrong)

print('\nRight approach (checks only Name column):')
print(right)

Output

Wrong approach (checks all columns): Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago Right approach (checks only Name column): Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago

📊

Quick Reference

Here is a quick summary of drop_duplicates() parameters:

Parameter	Description	Default
subset	Columns to consider for duplicates	None (all columns)
keep	Which duplicates to keep: 'first', 'last', or False (drop all)	'first'
inplace	Modify original DataFrame if True	False

✅

Key Takeaways

Use DataFrame.drop_duplicates() to remove duplicate rows easily.

Specify subset to check duplicates on specific columns only.

Set keep='first' or 'last' to control which duplicates remain.

Use inplace=True to modify the original DataFrame directly.

Remember keep=False removes all duplicates, leaving only unique rows.