0
0
Pandasdata~15 mins

drop_duplicates() for removal in Pandas - Mini Project: Build & Apply

Choose your learning style9 modes available
Remove Duplicate Rows Using drop_duplicates()
📖 Scenario: You work in a small store that keeps track of sales data in a table. Sometimes, the same sale is accidentally recorded twice. You want to clean the data by removing these duplicate sales.
🎯 Goal: Build a small program that creates a sales data table, sets a column to check for duplicates, removes duplicate rows using drop_duplicates(), and prints the cleaned data.
📋 What You'll Learn
Create a pandas DataFrame called sales_data with exact columns and rows
Create a variable called subset_column to specify which column to check for duplicates
Use drop_duplicates() on sales_data with the subset parameter set to subset_column
Print the cleaned DataFrame
💡 Why This Matters
🌍 Real World
Cleaning duplicate records is a common task in data analysis to ensure accurate results.
💼 Career
Data scientists and analysts often need to clean data by removing duplicates before analysis or reporting.
Progress0 / 4 steps
1
Create the sales data DataFrame
Create a pandas DataFrame called sales_data with these exact columns and rows:
SaleID: [101, 102, 103, 102, 104]
Product: ['Apple', 'Banana', 'Apple', 'Banana', 'Orange']
Quantity: [5, 3, 5, 3, 2]
Pandas
Need a hint?

Use pd.DataFrame and pass a dictionary with keys as column names and values as lists of data.

2
Set the column to check for duplicates
Create a variable called subset_column and set it to the string 'SaleID' to specify which column to check for duplicates.
Pandas
Need a hint?

Just assign the string 'SaleID' to the variable subset_column.

3
Remove duplicate rows using drop_duplicates()
Create a new DataFrame called cleaned_data by using sales_data.drop_duplicates() with the subset parameter set to subset_column.
Pandas
Need a hint?

Use drop_duplicates() on sales_data and pass subset=subset_column.

4
Print the cleaned data
Print the cleaned_data DataFrame to see the sales data after removing duplicates.
Pandas
Need a hint?

Use print(cleaned_data) to show the cleaned table.