0
0
Data Analysis Pythondata~30 mins

Removing duplicates (drop_duplicates) in Data Analysis Python - Mini Project: Build & Apply

Choose your learning style9 modes available
Removing duplicates (drop_duplicates)
📖 Scenario: You work in a small store that keeps track of sales data. Sometimes, the same sale is recorded twice by mistake. You want to clean the data by removing these duplicate sales records.
🎯 Goal: You will create a sales data table, set a rule to identify duplicates, remove the duplicate rows, and then show the cleaned data.
📋 What You'll Learn
Create a pandas DataFrame with sales data including duplicates
Set a variable to specify which columns to check for duplicates
Use drop_duplicates to remove duplicate rows based on the specified columns
Print the cleaned DataFrame
💡 Why This Matters
🌍 Real World
Cleaning duplicate records is a common task in data analysis to ensure accurate reports and decisions.
💼 Career
Data analysts and scientists often need to remove duplicate data entries before analysis to avoid errors.
Progress0 / 4 steps
1
Create the sales data table
Create a pandas DataFrame called sales_data with these exact columns and rows:
SaleID: [101, 102, 103, 102, 104]
Product: ['Apple', 'Banana', 'Apple', 'Banana', 'Orange']
Quantity: [5, 3, 5, 3, 2]
Data Analysis Python
Hint

Use pd.DataFrame with a dictionary where keys are column names and values are lists of column values.

2
Set columns to check for duplicates
Create a variable called check_columns and set it to a list containing the column names 'SaleID' and 'Product'.
Data Analysis Python
Hint

Make a list with the two column names exactly as strings.

3
Remove duplicate sales records
Create a new DataFrame called cleaned_sales by removing duplicate rows from sales_data using drop_duplicates with the subset set to check_columns.
Data Analysis Python
Hint

Use drop_duplicates with the subset parameter set to the list check_columns.

4
Show the cleaned sales data
Print the cleaned_sales DataFrame to display the sales data after removing duplicates.
Data Analysis Python
Hint

Use print(cleaned_sales) to show the DataFrame.