0
0
Pandasdata~20 mins

Duplicates on specific columns in Pandas - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Duplicates Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
Predict Output
intermediate
2:00remaining
Output of drop_duplicates on specific columns

What is the output DataFrame after running the code below?

Pandas
import pandas as pd

data = {'A': [1, 2, 2, 3, 4], 'B': [5, 6, 6, 7, 8], 'C': [9, 10, 11, 12, 13]}
df = pd.DataFrame(data)
result = df.drop_duplicates(subset=['A', 'B'])
print(result)
A{'A': [1, 2, 2, 3, 4], 'B': [5, 6, 6, 7, 8], 'C': [9, 10, 11, 12, 13]}
B{'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 12, 13]}
C{'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 11, 12, 13]}
D{'A': [1, 2, 2, 3], 'B': [5, 6, 6, 7], 'C': [9, 10, 11, 12]}
Attempts:
2 left
💡 Hint

drop_duplicates keeps the first occurrence of each duplicate based on the subset columns.

data_output
intermediate
2:00remaining
Count of duplicate rows on specific columns

What is the number of duplicate rows based on columns 'X' and 'Y' in the DataFrame?

Pandas
import pandas as pd

data = {'X': [1, 2, 2, 3, 3, 3], 'Y': ['a', 'b', 'b', 'c', 'c', 'd'], 'Z': [10, 20, 20, 30, 30, 40]}
df = pd.DataFrame(data)
duplicates = df.duplicated(subset=['X', 'Y'])
count = duplicates.sum()
print(count)
A1
B3
C4
D2
Attempts:
2 left
💡 Hint

duplicated() marks all but the first occurrence as duplicates.

🔧 Debug
advanced
2:00remaining
Identify the error in duplicate filtering code

What error will the following code produce?

Pandas
import pandas as pd

data = {'A': [1, 2, 2], 'B': [3, 4, 4]}
df = pd.DataFrame(data)
filtered = df[df.duplicated(columns=['A'])]
print(filtered)
ANo error, prints duplicate rows based on column 'A'
BKeyError: 'columns'
CTypeError: duplicated() got an unexpected keyword argument 'columns'
DValueError: subset columns not found
Attempts:
2 left
💡 Hint

Check the parameter name for specifying columns in duplicated().

🚀 Application
advanced
2:00remaining
Filter DataFrame to keep only duplicates on columns 'M' and 'N'

Which code snippet correctly filters the DataFrame to keep only rows that have duplicates based on columns 'M' and 'N'?

Pandas
import pandas as pd

data = {'M': [1, 2, 2, 3, 3, 3], 'N': ['x', 'y', 'y', 'z', 'z', 'w'], 'O': [5, 6, 7, 8, 9, 10]}
df = pd.DataFrame(data)
Adf[df.duplicated(subset=['M', 'N'], keep=False)]
Bdf[df.duplicated(subset=['M', 'N'])]
Cdf.drop_duplicates(subset=['M', 'N'], keep=False)
Ddf[df.duplicated(subset=['M', 'N'], keep='first')]
Attempts:
2 left
💡 Hint

Use keep=False to mark all duplicates, not just later ones.

🧠 Conceptual
expert
3:00remaining
Effect of keep parameter in drop_duplicates on specific columns

Consider a DataFrame with duplicate rows based on columns 'P' and 'Q'. What is the difference in output between drop_duplicates(subset=['P', 'Q'], keep='first') and drop_duplicates(subset=['P', 'Q'], keep='last')?

Akeep='first' keeps the first occurrence of each duplicate, keep='last' keeps the last occurrence; other duplicates are dropped accordingly.
Bkeep='first' keeps duplicates with the smallest index, keep='last' raises an error.
Ckeep='first' and keep='last' produce identical outputs.
Dkeep='first' drops all duplicates, keep='last' keeps all duplicates.
Attempts:
2 left
💡 Hint

Think about which duplicate row is kept when using 'first' or 'last'.