0
0
Pandasdata~15 mins

Duplicates on specific columns in Pandas - Deep Dive

Choose your learning style9 modes available
Overview - Duplicates on specific columns
What is it?
Duplicates on specific columns means finding rows in a table where certain columns have the same values. Instead of checking the whole row, we focus only on some columns to see if any values repeat. This helps us spot repeated information based on important parts of the data. It is useful when some columns define the uniqueness of data entries.
Why it matters
Without checking duplicates on specific columns, we might miss repeated data that causes errors or wrong analysis. For example, in a customer list, two entries might have the same email but different other details. Catching duplicates on key columns helps clean data, avoid mistakes, and make better decisions. It saves time and improves trust in data results.
Where it fits
Before this, you should know how to work with pandas DataFrames and basic filtering. After learning this, you can explore data cleaning techniques like handling missing values or merging datasets. This topic fits into the data cleaning and preprocessing stage of data science.
Mental Model
Core Idea
Checking duplicates on specific columns means looking only at chosen parts of each row to find repeated values, ignoring the rest.
Think of it like...
It's like checking if two people have the same phone number in a contact list, without caring if their addresses or names differ.
DataFrame rows:
┌─────────┬───────────┬───────────┬───────────┐
│ Row #   │ Column A  │ Column B  │ Column C  │
├─────────┼───────────┼───────────┼───────────┤
│ 0       │ apple     │ red       │ 10        │
│ 1       │ banana    │ yellow    │ 5         │
│ 2       │ apple     │ red       │ 7         │
│ 3       │ apple     │ green     │ 10        │
└─────────┴───────────┴───────────┴───────────┘

Checking duplicates on Column A and Column B finds rows 0 and 2 as duplicates because both have 'apple' and 'red'.
Build-Up - 7 Steps
1
FoundationUnderstanding DataFrame basics
🤔
Concept: Learn what a pandas DataFrame is and how data is organized in rows and columns.
A DataFrame is like a table with rows and columns. Each column has a name and holds data of one type. Rows represent records or entries. You can access columns by their names and rows by their positions or labels.
Result
You can create, view, and select parts of a DataFrame.
Knowing the structure of DataFrames is essential because duplicates are found by comparing rows based on columns.
2
FoundationWhat are duplicates in data?
🤔
Concept: Understand the meaning of duplicate rows and why they matter.
Duplicates are rows that have exactly the same values in all columns. They can cause problems like counting the same data twice or biasing results. Identifying duplicates helps clean data.
Result
You can spot exact repeated rows in a DataFrame.
Recognizing duplicates is the first step before focusing on specific columns.
3
IntermediateFinding duplicates on all columns
🤔Before reading on: do you think pandas can find duplicates by default on all columns or only on some columns? Commit to your answer.
Concept: Learn how pandas identifies duplicate rows by default using all columns.
pandas has a method called duplicated() that returns True for rows that are duplicates of previous rows considering all columns. For example: import pandas as pd data = {'A': ['x', 'y', 'x'], 'B': [1, 2, 1]} df = pd.DataFrame(data) df.duplicated() This returns [False, False, True] because the third row repeats the first.
Result
You get a boolean series marking duplicate rows based on all columns.
Understanding the default behavior helps you realize why specifying columns is needed when only some columns matter.
4
IntermediateSpecifying columns to check duplicates
🤔Before reading on: do you think specifying columns to check duplicates will find rows duplicated only in those columns or still consider all columns? Commit to your answer.
Concept: Learn how to tell pandas to check duplicates only on certain columns using the subset parameter.
The duplicated() method accepts a subset argument where you list columns to check. For example: df.duplicated(subset=['A']) This checks duplicates only based on column 'A'. Rows with the same 'A' value are marked duplicates regardless of other columns.
Result
You get a boolean series marking duplicates based on specified columns only.
Knowing how to limit duplicate checks to specific columns lets you focus on important data parts and ignore irrelevant differences.
5
IntermediateRemoving duplicates on specific columns
🤔Before reading on: do you think drop_duplicates() removes duplicates based on all columns by default or can it also use specific columns? Commit to your answer.
Concept: Learn how to remove duplicate rows based on specific columns using drop_duplicates().
drop_duplicates() removes duplicate rows. By default, it considers all columns, but you can specify subset columns: unique_df = df.drop_duplicates(subset=['A', 'B']) This keeps only the first occurrence of each unique combination of 'A' and 'B'.
Result
You get a DataFrame with duplicates removed based on chosen columns.
Removing duplicates on specific columns cleans data while preserving unique records based on key fields.
6
AdvancedKeeping last or all duplicates
🤔Before reading on: do you think drop_duplicates() can keep the last duplicate instead of the first? Commit to your answer.
Concept: Learn how to control which duplicate to keep using the keep parameter.
drop_duplicates() has a keep argument: - 'first' (default) keeps the first occurrence - 'last' keeps the last occurrence - False drops all duplicates Example: unique_df = df.drop_duplicates(subset=['A'], keep='last') This keeps the last row for each duplicate in 'A'.
Result
You control which duplicates remain after removal.
Knowing how to keep different duplicates helps when the latest or earliest record matters in your analysis.
7
ExpertPerformance and memory considerations
🤔Before reading on: do you think checking duplicates on large DataFrames is always fast and memory efficient? Commit to your answer.
Concept: Understand the internal performance aspects and memory use when finding duplicates on specific columns.
pandas uses hashing and sorting internally to find duplicates. Checking duplicates on many columns or large data can be slow and use much memory. Selecting fewer columns reduces work. Also, categorical data types speed up comparisons. For very large data, consider chunking or specialized tools.
Result
You understand when duplicate detection might slow down or use too much memory.
Knowing performance limits helps you write efficient data cleaning code and avoid crashes or long waits.
Under the Hood
pandas finds duplicates by hashing the values of the specified columns for each row. It then compares these hashes to detect repeated values. Internally, it uses efficient algorithms to avoid comparing every row with every other row. The subset parameter tells pandas which columns to include in the hash calculation. The keep parameter controls which duplicate to mark or drop.
Why designed this way?
This design balances speed and flexibility. Hashing allows quick detection without full pairwise comparisons. Allowing subset columns lets users focus on relevant data parts. The keep parameter supports common use cases like keeping first or last entries. Alternatives like full row comparison would be slower and less flexible.
DataFrame rows
┌─────────────┐
│ Row 0       │
│ Columns A,B │
│ Values      │
└─────┬───────┘
      │ hash(A,B)
      ▼
┌─────────────┐
│ Hash Table  │
│ Stores seen │
│ hashes     │
└─────┬───────┘
      │
      ▼
Check if hash exists → Mark duplicate or unique

Subset columns chosen → Hash only those columns
Keep parameter → Decide which duplicate to keep
Myth Busters - 4 Common Misconceptions
Quick: Does duplicated() remove duplicates or just mark them? Commit to yes or no.
Common Belief:duplicated() removes duplicate rows from the DataFrame.
Tap to reveal reality
Reality:duplicated() only returns a boolean series marking duplicates; it does not remove them. You must use drop_duplicates() to remove duplicates.
Why it matters:Confusing these causes bugs where duplicates remain unexpectedly, leading to incorrect data analysis.
Quick: If you specify subset=['A'], does duplicated() consider other columns? Commit to yes or no.
Common Belief:Specifying subset=['A'] still checks all columns for duplicates.
Tap to reveal reality
Reality:Specifying subset=['A'] makes duplicated() check duplicates only based on column 'A', ignoring other columns.
Why it matters:Misunderstanding this leads to missing duplicates or removing wrong rows.
Quick: Does drop_duplicates(keep=False) keep one duplicate or remove all duplicates? Commit to your answer.
Common Belief:drop_duplicates(keep=False) keeps one occurrence of duplicates.
Tap to reveal reality
Reality:drop_duplicates(keep=False) removes all rows that have duplicates, leaving only unique rows with no repeats.
Why it matters:Using keep=False unintentionally can delete more data than expected, causing data loss.
Quick: Are duplicates always exact matches across all columns? Commit to yes or no.
Common Belief:Duplicates mean rows are exactly the same in every column.
Tap to reveal reality
Reality:Duplicates can be defined on specific columns only, ignoring differences in other columns.
Why it matters:Assuming duplicates require full row equality can miss important repeated data patterns.
Expert Zone
1
When checking duplicates on columns with missing values, pandas treats NaNs as equal, so rows with NaNs in those columns can be duplicates.
2
Using categorical data types for columns speeds up duplicate detection because comparisons use integer codes instead of strings.
3
drop_duplicates() preserves the original row order by default, which can be important for time series or ordered data.
When NOT to use
Avoid using duplicated() or drop_duplicates() on extremely large datasets without sampling or chunking, as memory and speed can become bottlenecks. Instead, use database queries with SQL DISTINCT or specialized big data tools like Spark.
Production Patterns
In production, duplicates on specific columns are often checked during data ingestion to prevent bad data from entering systems. Automated pipelines use drop_duplicates() with subset columns to enforce uniqueness constraints. Also, duplicates detection is combined with logging and alerting to monitor data quality.
Connections
Data Cleaning
Duplicates detection is a core step in data cleaning.
Understanding duplicates on specific columns helps build reliable data cleaning pipelines that improve overall data quality.
Database Unique Constraints
Duplicates on specific columns relate to unique constraints in databases.
Knowing how duplicates work in pandas helps understand how databases enforce uniqueness and avoid duplicate records.
Set Theory
Duplicates detection is related to the concept of sets and uniqueness in mathematics.
Recognizing duplicates is like finding repeated elements in a set, which deepens understanding of data uniqueness and filtering.
Common Pitfalls
#1Assuming duplicated() removes duplicates automatically.
Wrong approach:df.duplicated(subset=['A']) # Expecting duplicates removed here
Correct approach:df = df.drop_duplicates(subset=['A']) # Removes duplicates based on column A
Root cause:Confusing duplicated() which marks duplicates with drop_duplicates() which removes them.
#2Not specifying subset when only some columns matter.
Wrong approach:df.drop_duplicates() # Removes duplicates based on all columns, missing duplicates on key columns
Correct approach:df.drop_duplicates(subset=['key_column']) # Removes duplicates based on key_column only
Root cause:Assuming default behavior checks only important columns instead of all columns.
#3Using keep=False without understanding its effect.
Wrong approach:df.drop_duplicates(subset=['A'], keep=False) # Removes all duplicates including first occurrences
Correct approach:df.drop_duplicates(subset=['A'], keep='first') # Keeps first occurrence, removes later duplicates
Root cause:Misunderstanding keep=False removes all duplicates, not just extras.
Key Takeaways
Duplicates on specific columns means checking repeated values only in chosen columns, not the whole row.
pandas duplicated() marks duplicates, while drop_duplicates() removes them; both accept a subset of columns.
Specifying subset columns focuses duplicate detection on important data parts, improving cleaning accuracy.
The keep parameter controls which duplicate to keep, supporting different data cleaning needs.
Performance can slow on large data or many columns; using categorical types and limiting columns helps.