0
0
PandasHow-ToBeginner · 3 min read

How to Remove Duplicate Rows in pandas DataFrame

Use the drop_duplicates() method on a pandas DataFrame to remove duplicate rows. You can specify columns to check for duplicates and choose to keep the first, last, or no duplicates by setting the subset and keep parameters.
📐

Syntax

The drop_duplicates() method removes duplicate rows from a DataFrame.

  • subset: Specify columns to consider for identifying duplicates. Default is all columns.
  • keep: Decide which duplicates to keep: 'first' (default), 'last', or False to drop all duplicates.
  • inplace: If True, modifies the original DataFrame; otherwise returns a new one.
python
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)
💻

Example

This example shows how to remove duplicate rows from a DataFrame. It demonstrates removing duplicates based on all columns and based on a specific column.

python
import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30],
        'City': ['NY', 'LA', 'NY', 'Chicago', 'LA']}

df = pd.DataFrame(data)

# Remove duplicates considering all columns, keep first occurrence
unique_all = df.drop_duplicates()

# Remove duplicates based on 'Name' column, keep last occurrence
unique_name_last = df.drop_duplicates(subset=['Name'], keep='last')

print('Original DataFrame:')
print(df)
print('\nAfter removing duplicates (all columns):')
print(unique_all)
print('\nAfter removing duplicates based on Name (keep last):')
print(unique_name_last)
Output
Original DataFrame: Name Age City 0 Alice 25 NY 1 Bob 30 LA 2 Alice 25 NY 3 David 40 Chicago 4 Bob 30 LA After removing duplicates (all columns): Name Age City 0 Alice 25 NY 1 Bob 30 LA 3 David 40 Chicago After removing duplicates based on Name (keep last): Name Age City 2 Alice 25 NY 4 Bob 30 LA 3 David 40 Chicago
⚠️

Common Pitfalls

Common mistakes when removing duplicates include:

  • Not specifying subset when you want to check duplicates only on certain columns.
  • Forgetting that drop_duplicates() returns a new DataFrame unless inplace=True is set.
  • Misunderstanding the keep parameter, which controls which duplicates remain.
python
import pandas as pd

data = {'A': [1, 1, 2], 'B': [3, 3, 4]}
df = pd.DataFrame(data)

# Wrong: drop_duplicates called but result not saved or inplace not set
# This does NOT remove duplicates from df

df.drop_duplicates()
print('DataFrame after drop_duplicates without assignment:')
print(df)

# Right: assign back or use inplace=True

df_clean = df.drop_duplicates()
print('\nDataFrame after drop_duplicates with assignment:')
print(df_clean)
Output
DataFrame after drop_duplicates without assignment: A B 0 1 3 1 1 3 2 2 4 DataFrame after drop_duplicates with assignment: A B 0 1 3 2 2 4
📊

Quick Reference

Summary of drop_duplicates() parameters:

ParameterDescriptionDefault
subsetColumns to consider for identifying duplicatesNone (all columns)
keepWhich duplicates to keep: 'first', 'last', or False (drop all duplicates)'first'
inplaceModify original DataFrame if TrueFalse

Key Takeaways

Use df.drop_duplicates() to remove duplicate rows from a pandas DataFrame.
Specify subset to check duplicates on specific columns only.
Remember drop_duplicates returns a new DataFrame unless inplace=True is set.
Use keep='first' or keep='last' to control which duplicate rows remain.
Check your DataFrame after removing duplicates to confirm changes.