Pandasdata~15 mins

drop_duplicates() for removal in Pandas - Mini Project: Build & Apply

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Remove Duplicate Rows Using drop_duplicates()

📖 Scenario: You work in a small store that keeps track of sales data in a table. Sometimes, the same sale is accidentally recorded twice. You want to clean the data by removing these duplicate sales.

🎯 Goal: Build a small program that creates a sales data table, sets a column to check for duplicates, removes duplicate rows using drop_duplicates(), and prints the cleaned data.

📋 What You'll Learn

Create a pandas DataFrame called sales_data with exact columns and rows

Create a variable called subset_column to specify which column to check for duplicates

Use drop_duplicates() on sales_data with the subset parameter set to subset_column

Print the cleaned DataFrame

💡 Why This Matters

🌍 Real World

Cleaning duplicate records is a common task in data analysis to ensure accurate results.

💼 Career

Data scientists and analysts often need to clean data by removing duplicates before analysis or reporting.

Progress0 / 4 steps

Create the sales data DataFrame

Create a pandas DataFrame called sales_data with these exact columns and rows:
SaleID: [101, 102, 103, 102, 104]
Product: ['Apple', 'Banana', 'Apple', 'Banana', 'Orange']
Quantity: [5, 3, 5, 3, 2]

Pandas

import pandas as pd
# Create the sales_data DataFrame with the exact data
# Your code here

Need a hint?

Use pd.DataFrame and pass a dictionary with keys as column names and values as lists of data.

Set the column to check for duplicates

Create a variable called subset_column and set it to the string 'SaleID' to specify which column to check for duplicates.

Pandas

import pandas as pd
sales_data = pd.DataFrame({
    'SaleID': [101, 102, 103, 102, 104],
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Orange'],
    'Quantity': [5, 3, 5, 3, 2]
})

# Create subset_column variable with value 'SaleID'
# Your code here

Need a hint?

Just assign the string 'SaleID' to the variable subset_column.

Remove duplicate rows using drop_duplicates()

Create a new DataFrame called cleaned_data by using sales_data.drop_duplicates() with the subset parameter set to subset_column.

Pandas

import pandas as pd
sales_data = pd.DataFrame({
    'SaleID': [101, 102, 103, 102, 104],
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Orange'],
    'Quantity': [5, 3, 5, 3, 2]
})
subset_column = 'SaleID'

# Use drop_duplicates() on sales_data with subset=subset_column and save to cleaned_data
# Your code here

Need a hint?

Use drop_duplicates() on sales_data and pass subset=subset_column.

Print the cleaned data

Print the cleaned_data DataFrame to see the sales data after removing duplicates.

Pandas

import pandas as pd
sales_data = pd.DataFrame({
    'SaleID': [101, 102, 103, 102, 104],
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Orange'],
    'Quantity': [5, 3, 5, 3, 2]
})
subset_column = 'SaleID'
cleaned_data = sales_data.drop_duplicates(subset=subset_column)

# Print the cleaned_data DataFrame
# Your code here

Need a hint?

Use print(cleaned_data) to show the cleaned table.