0
0
Pandasdata~10 mins

Duplicates on specific columns in Pandas - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Duplicates on specific columns
Start with DataFrame
Select specific columns
Check duplicates in these columns
Mark duplicates True/False
Use result to filter or analyze
End
We start with a DataFrame, select columns to check, find duplicates there, mark them, and then use this info to filter or analyze.
Execution Sample
Pandas
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3],
    'B': ['x', 'y', 'y', 'z'],
    'C': [10, 20, 20, 30]
})

duplicates = df.duplicated(subset=['A', 'B'])
This code creates a DataFrame and finds duplicates based on columns 'A' and 'B'.
Execution Table
IndexRow Data (A,B,C)Check Duplicates on ['A','B']Is Duplicate?Action
0(1, x, 10)First occurrence of (1,x)FalseKeep
1(2, y, 20)First occurrence of (2,y)FalseKeep
2(2, y, 20)Duplicate of index 1 (2,y)TrueMark duplicate
3(3, z, 30)First occurrence of (3,z)FalseKeep
💡 All rows checked; duplicates marked True only if same values in columns 'A' and 'B' appeared before.
Variable Tracker
VariableStartAfter Index 0After Index 1After Index 2After Index 3Final
duplicatesemptyFalseFalseTrueFalse[False, False, True, False]
Key Moments - 2 Insights
Why is the third row marked as duplicate even though column 'C' is the same as the second row?
Duplicates are checked only on columns 'A' and 'B' as per subset=['A','B'], so column 'C' does not affect duplicate marking (see execution_table row 2).
What happens if we do not specify the subset parameter?
If subset is not specified, duplicates are checked on all columns, so rows must match exactly in all columns to be marked duplicate.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the value of duplicates at index 2?
AFalse
BTrue
CNone
DError
💡 Hint
Check the 'Is Duplicate?' column for index 2 in execution_table.
At which index does the duplicate condition become True for the first time?
AIndex 2
BIndex 1
CIndex 0
DIndex 3
💡 Hint
Look at the 'Is Duplicate?' column in execution_table to find the first True.
If we add column 'C' to the subset parameter, how would the duplicates array change?
AMore duplicates would be marked True
BDuplicates would remain the same
CNo duplicates would be marked True
DCode would error
💡 Hint
Check execution_table: rows 1 and 2 have the same value in columns 'A' and 'B' but also 'C'. Adding 'C' means duplicates must match all three columns, so the duplicate at index 2 would be False.
Concept Snapshot
Duplicates on specific columns:
- Use df.duplicated(subset=[col1, col2])
- Checks duplicates only on chosen columns
- Returns Boolean Series marking duplicates
- First occurrence is False, repeats are True
- Useful to filter or clean data by columns
Full Transcript
This visual execution shows how to find duplicates in a pandas DataFrame based on specific columns. We start with a DataFrame of four rows and three columns. We use the duplicated() method with subset=['A','B'] to check duplicates only on columns A and B. The method returns a Boolean list marking True for rows that repeat the same values in these columns. The first row with a unique combination is marked False. For example, the third row has the same values in A and B as the second row, so it is marked True as a duplicate. Column C is ignored in this check. This helps us identify duplicates based on selected columns only. The variable tracker shows how the duplicates list builds up step by step. The key moments clarify common confusions about which columns affect duplicates and what happens if subset is omitted. The quiz questions test understanding of the duplicate marking and effects of changing subset columns.