Challenge - 5 Problems

🎖️

Duplicates Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of drop_duplicates on specific columns

What is the output DataFrame after running the code below?

Pandas

import pandas as pd

data = {'A': [1, 2, 2, 3, 4], 'B': [5, 6, 6, 7, 8], 'C': [9, 10, 11, 12, 13]}
df = pd.DataFrame(data)
result = df.drop_duplicates(subset=['A', 'B'])
print(result)

A{'A': [1, 2, 2, 3, 4], 'B': [5, 6, 6, 7, 8], 'C': [9, 10, 11, 12, 13]}

B{'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 10, 12, 13]}

C{'A': [1, 2, 3, 4], 'B': [5, 6, 7, 8], 'C': [9, 11, 12, 13]}

D{'A': [1, 2, 2, 3], 'B': [5, 6, 6, 7], 'C': [9, 10, 11, 12]}

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Count of duplicate rows on specific columns

What is the number of duplicate rows based on columns 'X' and 'Y' in the DataFrame?

Pandas

import pandas as pd

data = {'X': [1, 2, 2, 3, 3, 3], 'Y': ['a', 'b', 'b', 'c', 'c', 'd'], 'Z': [10, 20, 20, 30, 30, 40]}
df = pd.DataFrame(data)
duplicates = df.duplicated(subset=['X', 'Y'])
count = duplicates.sum()
print(count)

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in duplicate filtering code

What error will the following code produce?

Pandas

import pandas as pd

data = {'A': [1, 2, 2], 'B': [3, 4, 4]}
df = pd.DataFrame(data)
filtered = df[df.duplicated(columns=['A'])]
print(filtered)

ANo error, prints duplicate rows based on column 'A'

BKeyError: 'columns'

CTypeError: duplicated() got an unexpected keyword argument 'columns'

DValueError: subset columns not found

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Filter DataFrame to keep only duplicates on columns 'M' and 'N'

Which code snippet correctly filters the DataFrame to keep only rows that have duplicates based on columns 'M' and 'N'?

Pandas

import pandas as pd

data = {'M': [1, 2, 2, 3, 3, 3], 'N': ['x', 'y', 'y', 'z', 'z', 'w'], 'O': [5, 6, 7, 8, 9, 10]}
df = pd.DataFrame(data)

Adf[df.duplicated(subset=['M', 'N'], keep=False)]

Bdf[df.duplicated(subset=['M', 'N'])]

Cdf.drop_duplicates(subset=['M', 'N'], keep=False)

Ddf[df.duplicated(subset=['M', 'N'], keep='first')]

Attempts:

2 left

🧠 Conceptual

expert

3:00remaining

Effect of keep parameter in drop_duplicates on specific columns

Consider a DataFrame with duplicate rows based on columns 'P' and 'Q'. What is the difference in output between drop_duplicates(subset=['P', 'Q'], keep='first') and drop_duplicates(subset=['P', 'Q'], keep='last')?

Akeep='first' keeps the first occurrence of each duplicate, keep='last' keeps the last occurrence; other duplicates are dropped accordingly.

Bkeep='first' keeps duplicates with the smallest index, keep='last' raises an error.

Ckeep='first' and keep='last' produce identical outputs.

Dkeep='first' drops all duplicates, keep='last' keeps all duplicates.

Attempts:

2 left