Challenge - 5 Problems

🎖️

Duplicate Detective

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Counting duplicate rows in a DataFrame

What is the output of this code that counts duplicate rows in a pandas DataFrame?

Pandas

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 3, 3],
    'B': ['x', 'y', 'y', 'z', 'z', 'z']
})

count_duplicates = df.duplicated().sum()
print(count_duplicates)

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Counting duplicates with subset columns

Given this DataFrame, what is the output of counting duplicates only based on column 'A'?

Pandas

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 3, 3],
    'B': ['x', 'y', 'z', 'z', 'y', 'z']
})

count_dup_subset = df.duplicated(subset=['A']).sum()
print(count_dup_subset)

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in counting duplicates

What error does this code raise when trying to count duplicates in a DataFrame?

Pandas

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 2], 'B': ['x', 'y', 'y']})

count = df.duplicated(subset='A', keep='maybe').sum()
print(count)

AValueError: keep must be one of {'first', 'last', False}

BNo error, prints 1

CSyntaxError: invalid syntax

DTypeError: duplicated() got an unexpected keyword argument 'keep'

Attempts:

2 left

🚀 Application

advanced

2:00remaining

Find how many unique duplicate rows exist

How many unique rows appear more than once in this DataFrame?

Pandas

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 2, 3, 3, 3, 4],
    'B': ['x', 'y', 'y', 'z', 'z', 'z', 'w']
})

# Count unique rows that have duplicates
counts = df.value_counts()
num_unique_duplicates = (counts > 1).sum()
print(num_unique_duplicates)

Attempts:

2 left

🧠 Conceptual

expert

2:00remaining

Understanding duplicated() with keep=False

What is the output of this code that marks all duplicates including the first occurrence?

Pandas

import pandas as pd

df = pd.DataFrame({
    'A': [2, 2, 2, 3, 3, 3],
    'B': ['y', 'y', 'y', 'z', 'z', 'z']
})

duplicates_all = df.duplicated(keep=False)
print(duplicates_all.sum())

Attempts:

2 left