0
0
PandasHow-ToBeginner · 3 min read

How to Remove Duplicates in pandas DataFrames Easily

Use the drop_duplicates() method on a pandas DataFrame to remove duplicate rows. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates by setting the subset and keep parameters.
📐

Syntax

The basic syntax to remove duplicates in pandas is:

  • DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

subset: List of columns to consider for identifying duplicates. If None, all columns are used.

keep: Which duplicates to keep. Options are 'first' (default), 'last', or False (drop all duplicates).

inplace: If True, modifies the original DataFrame. Otherwise, returns a new DataFrame.

python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
💻

Example

This example shows how to remove duplicate rows from a DataFrame. It keeps the first occurrence of each duplicate.

python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

print('Original DataFrame:')
print(df)

# Remove duplicates keeping the first occurrence
clean_df = df.drop_duplicates()

print('\nDataFrame after removing duplicates:')
print(clean_df)
Output
Original DataFrame: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Alice 25 NY 3 David 40 Chicago 4 Bob 30 LA DataFrame after removing duplicates: Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago
⚠️

Common Pitfalls

Common mistakes when removing duplicates include:

  • Not specifying subset when you want to check duplicates only on certain columns.
  • Forgetting to set inplace=True if you want to modify the original DataFrame.
  • Using keep=False without realizing it removes all duplicates, leaving only unique rows.
python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

# Wrong: Does not specify subset, but wants to remove duplicates only by 'Name'
wrong = df.drop_duplicates()

# Right: Specify subset to check duplicates only on 'Name'
right = df.drop_duplicates(subset=['Name'])

print('Wrong approach (checks all columns):')
print(wrong)

print('\nRight approach (checks only Name column):')
print(right)
Output
Wrong approach (checks all columns): Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago Right approach (checks only Name column): Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago
📊

Quick Reference

Here is a quick summary of drop_duplicates() parameters:

ParameterDescriptionDefault
subsetColumns to consider for duplicatesNone (all columns)
keepWhich duplicates to keep: 'first', 'last', or False (drop all)'first'
inplaceModify original DataFrame if TrueFalse

Key Takeaways

Use DataFrame.drop_duplicates() to remove duplicate rows easily.
Specify subset to check duplicates on specific columns only.
Set keep='first' or 'last' to control which duplicates remain.
Use inplace=True to modify the original DataFrame directly.
Remember keep=False removes all duplicates, leaving only unique rows.