Pandasdata~10 mins

Duplicates on specific columns in Pandas - Step-by-Step Execution

Choose your learning style9 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Concept Flow - Duplicates on specific columns

Start with DataFrame

↓

Select specific columns

↓

Check duplicates in these columns

↓

Mark duplicates True/False

↓

Use result to filter or analyze

↓

End

We start with a DataFrame, select columns to check, find duplicates there, mark them, and then use this info to filter or analyze.

Execution Sample

Pandas

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3],
    'B': ['x', 'y', 'y', 'z'],
    'C': [10, 20, 20, 30]
})

duplicates = df.duplicated(subset=['A', 'B'])

This code creates a DataFrame and finds duplicates based on columns 'A' and 'B'.

Execution Table

Index	Row Data (A,B,C)	Check Duplicates on ['A','B']	Is Duplicate?	Action
0	(1, x, 10)	First occurrence of (1,x)	False	Keep
1	(2, y, 20)	First occurrence of (2,y)	False	Keep
2	(2, y, 20)	Duplicate of index 1 (2,y)	True	Mark duplicate
3	(3, z, 30)	First occurrence of (3,z)	False	Keep

💡 All rows checked; duplicates marked True only if same values in columns 'A' and 'B' appeared before.

Variable Tracker

Variable	Start	After Index 0	After Index 1	After Index 2	After Index 3	Final
duplicates	empty	False	False	True	False	[False, False, True, False]

Key Moments - 2 Insights

Why is the third row marked as duplicate even though column 'C' is the same as the second row?

What happens if we do not specify the subset parameter?

Visual Quiz - 3 Questions

Test your understanding

Look at the execution_table, what is the value of duplicates at index 2?

AFalse

BTrue

CNone

DError

Concept Snapshot

Duplicates on specific columns:
- Use df.duplicated(subset=[col1, col2])
- Checks duplicates only on chosen columns
- Returns Boolean Series marking duplicates
- First occurrence is False, repeats are True
- Useful to filter or clean data by columns

Full Transcript

This visual execution shows how to find duplicates in a pandas DataFrame based on specific columns. We start with a DataFrame of four rows and three columns. We use the duplicated() method with subset=['A','B'] to check duplicates only on columns A and B. The method returns a Boolean list marking True for rows that repeat the same values in these columns. The first row with a unique combination is marked False. For example, the third row has the same values in A and B as the second row, so it is marked True as a duplicate. Column C is ignored in this check. This helps us identify duplicates based on selected columns only. The variable tracker shows how the duplicates list builds up step by step. The key moments clarify common confusions about which columns affect duplicates and what happens if subset is omitted. The quiz questions test understanding of the duplicate marking and effects of changing subset columns.