Recall & Review
beginner
What does the pandas function
duplicated() do when used on a DataFrame?It returns a Boolean Series indicating whether each row is a duplicate of a previous row. By default, it checks all columns.
Click to reveal answer
beginner
How can you check for duplicates only on specific columns in a pandas DataFrame?
Use
df.duplicated(subset=[column_names]) where column_names is a list of columns to check for duplicates.Click to reveal answer
beginner
What does the parameter
keep='first' do in duplicated()?It marks all duplicates as True except for the first occurrence, which is marked as False.
Click to reveal answer
beginner
How do you remove duplicate rows based on specific columns in pandas?
Use
df.drop_duplicates(subset=[column_names]) to keep only the first occurrence of duplicates in those columns.Click to reveal answer
beginner
Why might you want to check duplicates on specific columns instead of the whole DataFrame?
Because sometimes only certain columns define the uniqueness of a row, and other columns may have different data but still represent the same entity.
Click to reveal answer
Which pandas method helps identify duplicate rows based on specific columns?
✗ Incorrect
The duplicated() method with the subset parameter checks duplicates on specific columns.
What does
df.duplicated(subset=['col1', 'col2'], keep='last') do?✗ Incorrect
The keep='last' parameter marks duplicates as True except for the last occurrence.
How do you remove duplicate rows based on columns 'A' and 'B' in pandas?
✗ Incorrect
drop_duplicates with subset=['A', 'B'] removes duplicates based on those columns.
If you want to find duplicates considering all columns, what should you do?
✗ Incorrect
By default, duplicated() checks duplicates on all columns if subset is not provided.
Why is it useful to specify columns in the subset parameter when checking duplicates?
✗ Incorrect
Specifying subset focuses duplicate detection on columns that matter for uniqueness.
Explain how to find and remove duplicate rows based on specific columns in a pandas DataFrame.
Think about how to tell pandas which columns to check for duplicates.
You got /3 concepts.
Why might checking duplicates on the entire DataFrame be less useful than checking on specific columns?
Consider real-life examples where only some details matter for identifying duplicates.
You got /3 concepts.