Challenge - 5 Problems

🎖️

Stratified K-fold Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why use Stratified K-fold instead of regular K-fold?

Imagine you have a dataset with two classes: 90% are class A and 10% are class B. You want to split the data into 5 folds for cross-validation.

Why is Stratified K-fold better than regular K-fold in this case?

AIt ensures each fold has approximately the same percentage of samples of each class as the whole dataset.

BIt randomly shuffles data without considering class distribution, which is faster.

CIt creates folds with only one class each to simplify training.

DIt duplicates minority class samples to balance the dataset before splitting.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of StratifiedKFold split indices

Given this code, what is the output of the printed train indices for the first fold?

ML Python

from sklearn.model_selection import StratifiedKFold
import numpy as np

X = np.array([[i] for i in range(10)])
y = np.array([0,0,0,0,1,1,1,1,1,1])

skf = StratifiedKFold(n_splits=2, shuffle=False)

for fold, (train_index, test_index) in enumerate(skf.split(X, y)):
    if fold == 0:
        print(train_index.tolist())

A[2, 3, 7, 8, 9]

B[0, 1, 2, 3, 4]

C[5, 6, 7, 8, 9]

D[0, 1, 2, 3, 4, 5]

Attempts:

2 left

❓ Model Choice

advanced

2:00remaining

Choosing the best cross-validation method for imbalanced data

You have a dataset with 95% of samples in class 0 and 5% in class 1. You want to evaluate a classification model's performance reliably.

Which cross-validation method is best to use?

ALeave-One-Out cross-validation

BRandom train-test split without cross-validation

CStratified K-fold cross-validation

DRegular K-fold cross-validation without stratification

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Effect of increasing n_splits in StratifiedKFold

What is the effect of increasing the number of splits (n_splits) in StratifiedKFold on the training and validation sets?

ABoth training and validation sets remain the same size regardless of n_splits.

BTraining sets become smaller and validation sets become smaller, leading to overfitting.

CTraining sets become smaller and validation sets become larger, increasing variance in performance estimates.

DTraining sets become larger and validation sets become smaller, reducing bias in performance estimates.

Attempts:

2 left

🔧 Debug

expert

2:00remaining

Why does this StratifiedKFold code raise an error?

Consider this code snippet:

from sklearn.model_selection import StratifiedKFold
import numpy as np

X = np.array([[i] for i in range(6)])
y = np.array([0, 0, 1, 1, 1, 1])

skf = StratifiedKFold(n_splits=3)

for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)

Running this code raises a ValueError. What is the cause?

AThe input arrays X and y have mismatched lengths.

BThe number of splits is greater than the number of members in the smallest class.

CStratifiedKFold requires shuffle=True when n_splits > 2.

DThe labels y contain non-integer values which StratifiedKFold cannot handle.

Attempts:

2 left