Data Analysis Pythondata~10 mins

Removing duplicates (drop_duplicates) in Data Analysis Python - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Removing duplicates (drop_duplicates)

Start with DataFrame

↓

Check each row for duplicates

↓

Mark duplicates as True/False

↓

Keep first occurrence, remove others

↓

Return DataFrame without duplicates

↓

End

The process checks each row in the data, marks duplicates, keeps the first occurrence, and removes the rest to return a clean DataFrame.

Execution Sample

Data Analysis Python

import pandas as pd

df = pd.DataFrame({
    'Name': ['Anna', 'Bob', 'Anna', 'Cody'],
    'Age': [25, 30, 25, 22]
})

clean_df = df.drop_duplicates()

This code creates a DataFrame with duplicate rows and removes duplicates using drop_duplicates.

Execution Table

Step	Row Index	Row Data	Is Duplicate?	Action	Resulting DataFrame Rows
1	0	{'Name': 'Anna', 'Age': 25}	No	Keep	[0]
2	1	{'Name': 'Bob', 'Age': 30}	No	Keep	[0, 1]
3	2	{'Name': 'Anna', 'Age': 25}	Yes	Remove	[0, 1]
4	3	{'Name': 'Cody', 'Age': 22}	No	Keep	[0, 1, 3]
5	-	-	-	-	Final DataFrame has rows with indices [0, 1, 3]

💡 All rows checked; duplicates removed except first occurrences.

Variable Tracker

Variable	Start	After Step 1	After Step 2	After Step 3	After Step 4	Final
df	[4 rows]	[4 rows]	[4 rows]	[4 rows]	[4 rows]	[4 rows]
clean_df	N/A	N/A	N/A	[3 rows, duplicates removed]	[3 rows]	[3 rows]

Key Moments - 2 Insights

Why does drop_duplicates keep the first occurrence and remove later ones?

Does drop_duplicates remove rows with any difference or only exact duplicates?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution table, which row is identified as a duplicate and removed?

ARow with index 1

BRow with index 3

CRow with index 2

DRow with index 0

Concept Snapshot

drop_duplicates removes duplicate rows from a DataFrame.
By default, it keeps the first occurrence and removes later duplicates.
Duplicates are rows where all column values match exactly.
Use keep='last' to keep the last occurrence instead.
Returns a new DataFrame without duplicates.

Full Transcript

This visual execution shows how pandas drop_duplicates works step-by-step. Starting with a DataFrame of four rows, it checks each row for duplicates. The first 'Anna' row is kept. The second 'Anna' row is detected as a duplicate and removed. Other unique rows are kept. The final DataFrame has three rows with duplicates removed. This helps clean data by removing repeated entries while keeping the original order.