0
0
Pandasdata~15 mins

Duplicates on specific columns in Pandas - Mini Project: Build & Apply

Choose your learning style9 modes available
Detecting Duplicates on Specific Columns with pandas
📖 Scenario: You work in a retail company. You have a list of sales records. Sometimes, the same customer buys the same product more than once. You want to find these repeated purchases by checking duplicates only on the CustomerID and ProductID columns.
🎯 Goal: Build a small program that creates a sales data table, sets the columns to check for duplicates, finds the duplicate rows based on those columns, and prints the duplicate rows.
📋 What You'll Learn
Create a pandas DataFrame with sales data including CustomerID, ProductID, and Quantity columns.
Create a list variable with the column names CustomerID and ProductID to check duplicates on.
Use pandas duplicated() method with the subset parameter to find duplicates based on those columns.
Print the duplicate rows from the DataFrame.
💡 Why This Matters
🌍 Real World
Retail companies often want to find repeated purchases by the same customer for the same product to analyze buying patterns or detect errors.
💼 Career
Data analysts and data scientists frequently use pandas to clean and analyze data, including finding duplicates based on specific columns.
Progress0 / 4 steps
1
Create the sales data DataFrame
Import pandas as pd. Create a DataFrame called sales with these exact rows and columns: CustomerID, ProductID, and Quantity. Use this data: (1, 101, 2), (2, 102, 1), (1, 101, 3), (3, 103, 5), (2, 102, 2).
Pandas
Need a hint?

Use pd.DataFrame with a dictionary where keys are column names and values are lists of column values.

2
Set the columns to check for duplicates
Create a list variable called cols_to_check that contains the strings 'CustomerID' and 'ProductID'.
Pandas
Need a hint?

Just create a list with the two column names as strings.

3
Find duplicate rows based on specific columns
Create a variable called duplicates that stores the rows from sales where duplicates exist based on the columns in cols_to_check. Use sales.duplicated(subset=cols_to_check, keep=False) to find all duplicate rows.
Pandas
Need a hint?

Use sales.duplicated() with subset=cols_to_check and keep=False to mark all duplicates, then filter sales with that boolean mask.

4
Print the duplicate rows
Write a print statement to display the duplicates DataFrame.
Pandas
Need a hint?

Just use print(duplicates) to show the duplicate rows.