What is Duplicates on specific columns in Pandas?

Pandasdata~5 mins

Duplicates on specific columns in Pandas

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Introduction

Finding duplicates on specific columns helps you spot repeated information in just those parts of your data. This is useful to clean or analyze data more accurately.

You want to check if customers have multiple orders with the same product.

You need to find repeated entries based on email addresses in a contact list.

You want to remove duplicate rows that share the same date and location in event data.

You want to count how many times a specific combination of columns appears in your dataset.

Syntax

Pandas

DataFrame.duplicated(subset=[column_names], keep='first')

subset lets you pick which columns to check for duplicates.

keep controls which duplicates to mark: 'first' keeps the first occurrence, 'last' keeps the last, and False marks all duplicates.

Examples

Find duplicates based on the 'email' column, marking all but the first occurrence as duplicates.

Pandas

df.duplicated(subset=['email'])

Mark all rows that have duplicate combinations of 'date' and 'location', including the first occurrences.

Pandas

df.duplicated(subset=['date', 'location'], keep=False)

Mark duplicates based on 'product_id', but keep the last occurrence as not duplicate.

Pandas

df.duplicated(subset=['product_id'], keep='last')

Sample Program

This code creates a small table of orders. It checks for repeated orders by the same customer for the same product. It adds a new column showing which rows are duplicates (True) and which are not (False).

Pandas

import pandas as pd

data = {
    'order_id': [1, 2, 3, 4, 5, 6],
    'customer': ['Alice', 'Bob', 'Alice', 'Bob', 'Alice', 'Charlie'],
    'product': ['Book', 'Pen', 'Book', 'Pen', 'Notebook', 'Book']
}
df = pd.DataFrame(data)

# Find duplicates based on 'customer' and 'product' columns
duplicates = df.duplicated(subset=['customer', 'product'], keep='first')

# Show the original data with a new column marking duplicates
df['is_duplicate'] = duplicates
print(df)

OutputSuccess

Important Notes

Duplicates are checked row by row based on the selected columns only.

Setting keep=False marks all duplicates as True, including the first occurrence.

You can use the boolean result to filter or remove duplicates easily.

Summary

Use duplicated() with subset to find duplicates on specific columns.

The keep parameter controls which duplicates are marked.

This helps clean or analyze data where only some columns matter for duplication.