Pandasdata~30 mins

Keeping first vs last vs none in Pandas - Hands-On Comparison

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Keeping First vs Last vs None in pandas

📖 Scenario: You work in a store's data team. You have a list of sales records with some duplicate entries for the same product. You want to clean the data by removing duplicates but keep either the first sale, the last sale, or remove all duplicates completely.

🎯 Goal: Learn how to use pandas drop_duplicates() with keep='first', keep='last', and keep=False options to control which duplicates to keep or remove.

📋 What You'll Learn

Create a pandas DataFrame called sales with given data

Create a variable called subset_cols to specify columns to check duplicates

Use drop_duplicates() with keep='first' to keep first duplicates

Use drop_duplicates() with keep='last' to keep last duplicates

Use drop_duplicates() with keep=False to remove all duplicates

Print the resulting DataFrames

💡 Why This Matters

🌍 Real World

Cleaning duplicate sales records is common in retail data analysis to ensure accurate reporting and inventory management.

💼 Career

Data analysts and data scientists often need to remove or handle duplicates in datasets before analysis or modeling.

Progress0 / 4 steps

Create the sales DataFrame

Create a pandas DataFrame called sales with these exact columns and rows:

{'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Cherry', 'Apple'], 'Price': [100, 80, 100, 90, 120, 100], 'Quantity': [5, 7, 5, 8, 10, 5]}

Pandas

import pandas as pd
# Create the sales DataFrame with the given data
# Your code here

Need a hint?

Use pd.DataFrame() with a dictionary of lists for columns.

Create subset_cols variable for duplicate check

Create a variable called subset_cols and set it to a list containing the columns 'Product' and 'Price' to check duplicates based on these columns.

Pandas

import pandas as pd
sales = pd.DataFrame({'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Cherry', 'Apple'], 'Price': [100, 80, 100, 90, 120, 100], 'Quantity': [5, 7, 5, 8, 10, 5]})
# Create subset_cols list with 'Product' and 'Price'
# Your code here

Need a hint?

Just assign the list ['Product', 'Price'] to subset_cols.

Remove duplicates keeping the first occurrence

Create a new DataFrame called keep_first by using sales.drop_duplicates() with subset=subset_cols and keep='first' to keep the first duplicate rows.

Pandas

import pandas as pd
sales = pd.DataFrame({'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Cherry', 'Apple'], 'Price': [100, 80, 100, 90, 120, 100], 'Quantity': [5, 7, 5, 8, 10, 5]})
subset_cols = ['Product', 'Price']
# Create keep_first DataFrame by dropping duplicates keeping first
# Your code here

Need a hint?

Use drop_duplicates() with subset=subset_cols and keep='first'.

Remove duplicates keeping the last occurrence and removing all duplicates

Create two new DataFrames:
1. keep_last by dropping duplicates with subset=subset_cols and keep='last'.
2. keep_none by dropping duplicates with subset=subset_cols and keep=False to remove all duplicates.
Then print keep_first, keep_last, and keep_none.

Pandas

import pandas as pd
sales = pd.DataFrame({'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Cherry', 'Apple'], 'Price': [100, 80, 100, 90, 120, 100], 'Quantity': [5, 7, 5, 8, 10, 5]})
subset_cols = ['Product', 'Price']
keep_first = sales.drop_duplicates(subset=subset_cols, keep='first')
# Create keep_last DataFrame by dropping duplicates keeping last
# Create keep_none DataFrame by dropping duplicates keeping none
# Print keep_first, keep_last, and keep_none
# Your code here

Need a hint?

Use drop_duplicates() with keep='last' and keep=False. Then print all three DataFrames.