0
0
Pandasdata~10 mins

Counting duplicates in Pandas - Step-by-Step Execution

Choose your learning style9 modes available
Concept Flow - Counting duplicates
Start with DataFrame
Identify duplicate rows
Count duplicates per row or overall
Output counts or filtered DataFrame
End
We start with a DataFrame, find which rows are duplicates, count them, and then output the counts or filtered data.
Execution Sample
Pandas
import pandas as pd

df = pd.DataFrame({'A':[1,2,2,3,3,3], 'B':[5,6,6,7,7,7]})
dup_counts = df.duplicated(keep=False).sum()
This code creates a DataFrame and counts how many rows are duplicates.
Execution Table
StepActionDataFrame StateDuplicates IdentifiedCount Result
1Create DataFrame[{'A':1,'B':5},{'A':2,'B':6},{'A':2,'B':6},{'A':3,'B':7},{'A':3,'B':7},{'A':3,'B':7}]None yetNone yet
2Check duplicates with keep=FalseSame as step 1[False, True, True, True, True, True]None yet
3Sum True values for duplicatesSame as step 1[False, True, True, True, True, True]5
4Output total duplicate countSame as step 1[False, True, True, True, True, True]5
💡 All rows checked; total duplicates counted as 5
Variable Tracker
VariableStartAfter Step 2After Step 3Final
dfEmpty[{'A':1,'B':5},{'A':2,'B':6},{'A':2,'B':6},{'A':3,'B':7},{'A':3,'B':7},{'A':3,'B':7}]SameSame
dup_maskNone[False, True, True, True, True, True]SameSame
dup_countsNoneNone55
Key Moments - 3 Insights
Why does duplicated(keep=False) mark all duplicates as True, not just some?
Using keep=False marks every occurrence of a duplicate row as True, not just the later ones. See execution_table step 2 where all duplicates are True.
Why is the first row marked False even if there are duplicates?
The first row is unique in this example, so it's False. Only rows with exact duplicates get True. Check execution_table step 2 for the mask.
How does sum() count duplicates from the boolean mask?
True counts as 1 and False as 0, so sum() adds all True values to get total duplicates. See execution_table step 3.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table at step 2, what is the duplicate mask for the third row?
AFalse
BTrue
CNone
DError
💡 Hint
Check the 'Duplicates Identified' column at step 2 for the third row.
At which step does the code calculate the total number of duplicate rows?
AStep 3
BStep 2
CStep 1
DStep 4
💡 Hint
Look for the step where 'Count Result' shows a number.
If we change keep=False to keep='first' in duplicated(), how would the duplicate mask change at step 2?
AAll duplicates marked True
BOnly first occurrence True
COnly later duplicates True
DNo duplicates marked True
💡 Hint
Remember 'keep=first' marks duplicates except the first occurrence as True.
Concept Snapshot
Counting duplicates in pandas:
- Use df.duplicated(keep=False) to mark all duplicates True
- Sum the boolean mask to count duplicates
- keep='first' or 'last' marks only some duplicates
- Useful to find repeated rows in data
- Returns boolean Series for filtering or counting
Full Transcript
We start with a DataFrame containing some repeated rows. Using pandas duplicated() with keep=False marks all duplicates as True in a boolean mask. Summing this mask counts how many rows are duplicates. This helps identify repeated data. The mask shows True for duplicates and False for unique rows. Changing keep parameter changes which duplicates are marked. This step-by-step trace shows how pandas counts duplicates clearly.