Finding duplicates on specific columns helps you spot repeated information in just those parts of your data. This is useful to clean or analyze data more accurately.
Duplicates on specific columns in Pandas
DataFrame.duplicated(subset=[column_names], keep='first')subset lets you pick which columns to check for duplicates.
keep controls which duplicates to mark: 'first' keeps the first occurrence, 'last' keeps the last, and False marks all duplicates.
df.duplicated(subset=['email'])df.duplicated(subset=['date', 'location'], keep=False)
df.duplicated(subset=['product_id'], keep='last')
This code creates a small table of orders. It checks for repeated orders by the same customer for the same product. It adds a new column showing which rows are duplicates (True) and which are not (False).
import pandas as pd data = { 'order_id': [1, 2, 3, 4, 5, 6], 'customer': ['Alice', 'Bob', 'Alice', 'Bob', 'Alice', 'Charlie'], 'product': ['Book', 'Pen', 'Book', 'Pen', 'Notebook', 'Book'] } df = pd.DataFrame(data) # Find duplicates based on 'customer' and 'product' columns duplicates = df.duplicated(subset=['customer', 'product'], keep='first') # Show the original data with a new column marking duplicates df['is_duplicate'] = duplicates print(df)
Duplicates are checked row by row based on the selected columns only.
Setting keep=False marks all duplicates as True, including the first occurrence.
You can use the boolean result to filter or remove duplicates easily.
Use duplicated() with subset to find duplicates on specific columns.
The keep parameter controls which duplicates are marked.
This helps clean or analyze data where only some columns matter for duplication.