Pandasdata~15 mins

Duplicates on specific columns in Pandas - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Detecting Duplicates on Specific Columns with pandas

📖 Scenario: You work in a retail company. You have a list of sales records. Sometimes, the same customer buys the same product more than once. You want to find these repeated purchases by checking duplicates only on the CustomerID and ProductID columns.

🎯 Goal: Build a small program that creates a sales data table, sets the columns to check for duplicates, finds the duplicate rows based on those columns, and prints the duplicate rows.

📋 What You'll Learn

Create a pandas DataFrame with sales data including CustomerID, ProductID, and Quantity columns.

Create a list variable with the column names CustomerID and ProductID to check duplicates on.

Use pandas duplicated() method with the subset parameter to find duplicates based on those columns.

Print the duplicate rows from the DataFrame.

💡 Why This Matters

🌍 Real World

Retail companies often want to find repeated purchases by the same customer for the same product to analyze buying patterns or detect errors.

💼 Career

Data analysts and data scientists frequently use pandas to clean and analyze data, including finding duplicates based on specific columns.

Progress0 / 4 steps

Create the sales data DataFrame

Import pandas as pd. Create a DataFrame called sales with these exact rows and columns: CustomerID, ProductID, and Quantity. Use this data: (1, 101, 2), (2, 102, 1), (1, 101, 3), (3, 103, 5), (2, 102, 2).

Pandas

import pandas as pd
# Create the sales DataFrame with the exact data
# Your code here

Need a hint?

Use pd.DataFrame with a dictionary where keys are column names and values are lists of column values.

Set the columns to check for duplicates

Create a list variable called cols_to_check that contains the strings 'CustomerID' and 'ProductID'.

Pandas

import pandas as pd
sales = pd.DataFrame({
    'CustomerID': [1, 2, 1, 3, 2],
    'ProductID': [101, 102, 101, 103, 102],
    'Quantity': [2, 1, 3, 5, 2]
})
# Create the list cols_to_check with 'CustomerID' and 'ProductID'
# Your code here

Need a hint?

Just create a list with the two column names as strings.

Find duplicate rows based on specific columns

Create a variable called duplicates that stores the rows from sales where duplicates exist based on the columns in cols_to_check. Use sales.duplicated(subset=cols_to_check, keep=False) to find all duplicate rows.

Pandas

import pandas as pd
sales = pd.DataFrame({
    'CustomerID': [1, 2, 1, 3, 2],
    'ProductID': [101, 102, 101, 103, 102],
    'Quantity': [2, 1, 3, 5, 2]
})
cols_to_check = ['CustomerID', 'ProductID']
# Find duplicate rows based on cols_to_check and store in duplicates
# Your code here

Need a hint?

Use sales.duplicated() with subset=cols_to_check and keep=False to mark all duplicates, then filter sales with that boolean mask.

Print the duplicate rows

Write a print statement to display the duplicates DataFrame.

Pandas

import pandas as pd
sales = pd.DataFrame({
    'CustomerID': [1, 2, 1, 3, 2],
    'ProductID': [101, 102, 101, 103, 102],
    'Quantity': [2, 1, 3, 5, 2]
})
cols_to_check = ['CustomerID', 'ProductID']
duplicates = sales[sales.duplicated(subset=cols_to_check, keep=False)]
# Print the duplicates DataFrame
# Your code here

Need a hint?

Just use print(duplicates) to show the duplicate rows.