Challenge - 5 Problems

🎖️

Data Preparation Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why does data cleaning take so much time in ML projects?

In machine learning, data cleaning is a major part of data preparation. Why does it usually take the most time?

ABecause raw data often has errors, missing values, and inconsistencies that need fixing before training.

BBecause training models requires a lot of computing power and time.

CBecause data cleaning involves writing complex algorithms to improve model accuracy.

DBecause data cleaning is automated and runs slowly on large datasets.

Attempts:

2 left

🧠 Conceptual

intermediate

2:00remaining

What is the main reason feature engineering is time-consuming?

Feature engineering is a key step in data preparation. Why does it often consume a lot of time?

ABecause it only works on small datasets and needs manual scaling.

BBecause it involves training multiple models to select features.

CBecause it requires domain knowledge to create meaningful features from raw data.

DBecause it is fully automated and requires tuning many parameters.

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Measuring data quality impact on model accuracy

You have two datasets: Dataset A is clean and Dataset B has 20% missing values. You train the same model on both. Which metric difference best shows the impact of data quality?

ADataset A precision: 0.5, Dataset B precision: 0.5

BDataset A loss: 0.2, Dataset B loss: 0.2

CDataset A training time: 10 minutes, Dataset B training time: 15 minutes

DDataset A accuracy: 90%, Dataset B accuracy: 70%

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Why does this data scaling code cause poor model results?

Consider this Python snippet for scaling features before training:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.fit_transform(X_test)

Why might this cause poor model performance?

ABecause the scaler is fit separately on test data, causing data leakage and inconsistent scaling.

BBecause StandardScaler cannot be used on numeric data.

CBecause scaling is not needed before training models.

DBecause fit_transform should be replaced with transform on training data.

Attempts:

2 left

❓ Model Choice

expert

2:00remaining

Choosing the best approach to handle missing data in a large dataset

You have a large dataset with 30% missing values scattered randomly. Which approach is best to prepare data for a machine learning model?

ARemove all rows with missing values to keep only complete data.

BFill missing values with the mean or median of each feature before training.

CUse a model that can handle missing values natively without imputation.

DReplace missing values with zeros regardless of feature meaning.

Attempts:

2 left