0
0
Pandasdata~5 mins

Duplicates on specific columns in Pandas

Choose your learning style9 modes available
Introduction

Finding duplicates on specific columns helps you spot repeated information in just those parts of your data. This is useful to clean or analyze data more accurately.

You want to check if customers have multiple orders with the same product.
You need to find repeated entries based on email addresses in a contact list.
You want to remove duplicate rows that share the same date and location in event data.
You want to count how many times a specific combination of columns appears in your dataset.
Syntax
Pandas
DataFrame.duplicated(subset=[column_names], keep='first')

subset lets you pick which columns to check for duplicates.

keep controls which duplicates to mark: 'first' keeps the first occurrence, 'last' keeps the last, and False marks all duplicates.

Examples
Find duplicates based on the 'email' column, marking all but the first occurrence as duplicates.
Pandas
df.duplicated(subset=['email'])
Mark all rows that have duplicate combinations of 'date' and 'location', including the first occurrences.
Pandas
df.duplicated(subset=['date', 'location'], keep=False)
Mark duplicates based on 'product_id', but keep the last occurrence as not duplicate.
Pandas
df.duplicated(subset=['product_id'], keep='last')
Sample Program

This code creates a small table of orders. It checks for repeated orders by the same customer for the same product. It adds a new column showing which rows are duplicates (True) and which are not (False).

Pandas
import pandas as pd

data = {
    'order_id': [1, 2, 3, 4, 5, 6],
    'customer': ['Alice', 'Bob', 'Alice', 'Bob', 'Alice', 'Charlie'],
    'product': ['Book', 'Pen', 'Book', 'Pen', 'Notebook', 'Book']
}
df = pd.DataFrame(data)

# Find duplicates based on 'customer' and 'product' columns
duplicates = df.duplicated(subset=['customer', 'product'], keep='first')

# Show the original data with a new column marking duplicates
df['is_duplicate'] = duplicates
print(df)
OutputSuccess
Important Notes

Duplicates are checked row by row based on the selected columns only.

Setting keep=False marks all duplicates as True, including the first occurrence.

You can use the boolean result to filter or remove duplicates easily.

Summary

Use duplicated() with subset to find duplicates on specific columns.

The keep parameter controls which duplicates are marked.

This helps clean or analyze data where only some columns matter for duplication.