Challenge - 5 Problems

🎖️

Duplicate Detection Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Detecting duplicates in a DataFrame

What is the output of the following code that checks for duplicate rows in a DataFrame?

Pandas

import pandas as pd

data = {'Name': ['Anna', 'Bob', 'Anna', 'David'], 'Age': [25, 30, 25, 40]}
df = pd.DataFrame(data)

result = df.duplicated()
print(result.tolist())

A[True, True, False, False]

B[False, False, True, False]

C[False, True, False, True]

D[True, False, True, False]

Attempts:

2 left

❓ data_output

intermediate

2:00remaining

Removing duplicates from data

What does the DataFrame look like after removing duplicate rows?

Pandas

import pandas as pd

data = {'City': ['NY', 'LA', 'NY', 'Chicago'], 'Population': [8000, 4000, 8000, 2700]}
df = pd.DataFrame(data)

clean_df = df.drop_duplicates()
print(clean_df.reset_index(drop=True))

  City  Population
0   NY        8000
1   LA        4000
2  Chicago     2700

  City  Population
0   NY        8000
1   LA        4000
2   NY        8000
3  Chicago     2700

  City  Population
0   LA        4000
1  Chicago     2700

  City  Population
0   NY        8000
1  Chicago     2700

Attempts:

2 left

🧠 Conceptual

advanced

1:30remaining

Why duplicates can cause problems in analysis

Which of the following is the main reason why duplicate data can lead to incorrect analysis results?

ADuplicates always cause syntax errors in code.

BDuplicates increase the size of the dataset, making it slower to process.

CDuplicates make data visualization impossible.

DDuplicates can bias statistical calculations by counting the same data multiple times.

Attempts:

2 left

🔧 Debug

advanced

1:30remaining

Identifying the error in duplicate detection code

What error will this code produce when trying to detect duplicates?

Pandas

import pandas as pd

data = {'A': [1, 2, 2], 'B': [3, 4, 4]}
df = pd.DataFrame(data)

result = df.duplicated(axis=2)
print(result)

AValueError: No axis named 2 for object type DataFrame

BAttributeError: 'DataFrame' object has no attribute 'duplicated'

CTypeError: duplicated() got an unexpected keyword argument 'axis'

DSyntaxError: invalid syntax

Attempts:

2 left

🚀 Application

expert

2:00remaining

Impact of duplicates on machine learning model training

How can duplicate records in a training dataset affect the performance of a machine learning model?

ADuplicates have no effect because models ignore repeated data automatically.

BDuplicates speed up training by reducing the number of unique samples.

CDuplicates can cause the model to overfit by giving too much weight to repeated examples.

DDuplicates improve model accuracy by providing more data points.

Attempts:

2 left