Challenge - 5 Problems

🎖️

Overfitting Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why do we use a separate test set in machine learning?

Imagine you train a model on some data and then check how well it works on the same data. Why is it important to use a different set of data (test set) to evaluate the model?

ABecause the test set helps us see how the model performs on new, unseen data, preventing us from thinking it works better than it really does.

BBecause the test set is used to train the model faster by giving it extra examples.

CBecause the test set contains only easy examples that make the model look good.

DBecause the test set is used to tune the model’s parameters during training.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

What is the output of this model evaluation code?

Given the following code that splits data and evaluates a model, what will be printed?

ML Python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = DecisionTreeClassifier(random_state=42)
model.fit(X_train, y_train)
predictions_train = model.predict(X_train)
predictions_test = model.predict(X_test)
print(f"Train accuracy: {accuracy_score(y_train, predictions_train):.2f}")
print(f"Test accuracy: {accuracy_score(y_test, predictions_test):.2f}")

Train accuracy: 1.00
Test accuracy: 0.98

Train accuracy: 0.98
Test accuracy: 1.00

Train accuracy: 0.50
Test accuracy: 0.50

Train accuracy: 0.85
Test accuracy: 0.85

Attempts:

2 left

❓ Metrics

advanced

2:00remaining

Which metric best detects overfitting in classification?

You train a classifier and get very high accuracy on training data but much lower accuracy on test data. Which metric helps you best understand this overfitting problem?

ARecall on test data only

BPrecision on training data only

CF1 score on training data only

DDifference between training accuracy and test accuracy

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Which model choice helps reduce overfitting?

You want to prevent overfitting on a small dataset. Which model choice is best?

AUse the training data as the test data

BUse a simpler model with fewer parameters

CUse no validation and train longer

DUse a very deep neural network with many layers

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Why does this evaluation code give misleading results?

Look at this code snippet. It trains and evaluates a model but the evaluation results are misleading. Why?

ML Python

from sklearn.datasets import load_digits
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

data = load_digits()
X, y = data.data, data.target
model = DecisionTreeClassifier()
model.fit(X, y)
predictions = model.predict(X)
print(f"Accuracy: {accuracy_score(y, predictions):.2f}")

ABecause accuracy_score requires test data, not training data.

BBecause the model was not trained at all before prediction.

CBecause the model is evaluated on the same data it was trained on, causing overfitting to be hidden.

DBecause the dataset is too small to train any model.

Attempts:

2 left