Pandasdata~15 mins

duplicated() for finding duplicates in Pandas - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Using duplicated() to Find Duplicate Rows in Data

📖 Scenario: You work in a store's data team. You have a list of sales records. Sometimes, the same sale is recorded twice by mistake. You want to find these duplicate sales to fix the data.

🎯 Goal: You will create a small sales data table, then use pandas duplicated() to find which rows are duplicates.

📋 What You'll Learn

Create a pandas DataFrame called sales with given sales data

Create a variable keep_option to decide which duplicates to mark

Use duplicated() on sales with keep=keep_option to find duplicates

Print the boolean Series showing duplicate rows

💡 Why This Matters

🌍 Real World

Duplicate data can cause errors in reports and decisions. Finding duplicates helps keep data clean and trustworthy.

💼 Career

Data analysts and scientists often clean data by identifying and handling duplicates to ensure accurate analysis.

Progress0 / 4 steps

Create the sales DataFrame

Import pandas as pd and create a DataFrame called sales with these exact rows:

{'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple'], 'Price': [1.0, 0.5, 1.0, 0.5, 1.0], 'Quantity': [10, 5, 10, 5, 10]}

Pandas

import pandas as pd
# Create the sales DataFrame with the exact data
# Your code here

Need a hint?

Use pd.DataFrame() with a dictionary of lists for columns.

Set the keep option for duplicates

Create a variable called keep_option and set it to the string 'first' to mark duplicates except the first occurrence.

Pandas

import pandas as pd
sales = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple'],
    'Price': [1.0, 0.5, 1.0, 0.5, 1.0],
    'Quantity': [10, 5, 10, 5, 10]
})
# Set keep_option to 'first'
# Your code here

Need a hint?

The keep parameter in duplicated() can be 'first', 'last', or False.

Find duplicate rows using duplicated()

Create a variable called duplicates that stores the result of sales.duplicated(keep=keep_option) to find duplicate rows.

Pandas

import pandas as pd
sales = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple'],
    'Price': [1.0, 0.5, 1.0, 0.5, 1.0],
    'Quantity': [10, 5, 10, 5, 10]
})
keep_option = 'first'
# Use duplicated() with keep=keep_option to find duplicates
# Your code here

Need a hint?

Call duplicated() on the DataFrame with the keep argument.

Print the duplicates boolean Series

Write a print statement to display the duplicates variable.

Pandas

import pandas as pd
sales = pd.DataFrame({
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple'],
    'Price': [1.0, 0.5, 1.0, 0.5, 1.0],
    'Quantity': [10, 5, 10, 5, 10]
})
keep_option = 'first'
duplicates = sales.duplicated(keep=keep_option)
# Print the duplicates Series
# Your code here

Need a hint?

Use print(duplicates) to see which rows are duplicates.