0
0
Data Analysis Pythondata~10 mins

Removing duplicates (drop_duplicates) in Data Analysis Python - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Removing duplicates (drop_duplicates)
Start with DataFrame
Check each row for duplicates
Mark duplicates as True/False
Keep first occurrence, remove others
Return DataFrame without duplicates
End
The process checks each row in the data, marks duplicates, keeps the first occurrence, and removes the rest to return a clean DataFrame.
Execution Sample
Data Analysis Python
import pandas as pd

df = pd.DataFrame({
    'Name': ['Anna', 'Bob', 'Anna', 'Cody'],
    'Age': [25, 30, 25, 22]
})

clean_df = df.drop_duplicates()
This code creates a DataFrame with duplicate rows and removes duplicates using drop_duplicates.
Execution Table
StepRow IndexRow DataIs Duplicate?ActionResulting DataFrame Rows
10{'Name': 'Anna', 'Age': 25}NoKeep[0]
21{'Name': 'Bob', 'Age': 30}NoKeep[0, 1]
32{'Name': 'Anna', 'Age': 25}YesRemove[0, 1]
43{'Name': 'Cody', 'Age': 22}NoKeep[0, 1, 3]
5----Final DataFrame has rows with indices [0, 1, 3]
💡 All rows checked; duplicates removed except first occurrences.
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4Final
df[4 rows][4 rows][4 rows][4 rows][4 rows][4 rows]
clean_dfN/AN/AN/A[3 rows, duplicates removed][3 rows][3 rows]
Key Moments - 2 Insights
Why does drop_duplicates keep the first occurrence and remove later ones?
drop_duplicates by default keeps the first occurrence to preserve the original order and only remove repeated rows after that, as shown in execution_table rows 3 and 4.
Does drop_duplicates remove rows with any difference or only exact duplicates?
It removes only exact duplicates where all column values match, as seen in row 2 and 3 where the 'Anna' rows are identical and one is removed.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, which row is identified as a duplicate and removed?
ARow with index 1
BRow with index 3
CRow with index 2
DRow with index 0
💡 Hint
Check the 'Is Duplicate?' and 'Action' columns in execution_table row 3.
At which step does the DataFrame first lose a row due to duplicate removal?
AStep 2
BStep 3
CStep 4
DStep 5
💡 Hint
Look at the 'Resulting DataFrame Rows' column in execution_table rows 2 and 3.
If drop_duplicates was called with keep='last', which row would be removed instead?
ARow with index 0
BRow with index 1
CRow with index 2
DRow with index 3
💡 Hint
drop_duplicates with keep='last' keeps the last occurrence, so the first duplicate is removed.
Concept Snapshot
drop_duplicates removes duplicate rows from a DataFrame.
By default, it keeps the first occurrence and removes later duplicates.
Duplicates are rows where all column values match exactly.
Use keep='last' to keep the last occurrence instead.
Returns a new DataFrame without duplicates.
Full Transcript
This visual execution shows how pandas drop_duplicates works step-by-step. Starting with a DataFrame of four rows, it checks each row for duplicates. The first 'Anna' row is kept. The second 'Anna' row is detected as a duplicate and removed. Other unique rows are kept. The final DataFrame has three rows with duplicates removed. This helps clean data by removing repeated entries while keeping the original order.