Challenge - 5 Problems

🎖️

Train/Val/Test Split Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Predict Output

intermediate

2:00remaining

Output of PyTorch dataset split code

What will be the output of the following code snippet that splits a dataset into train, validation, and test sets using PyTorch's random_split?

PyTorch

from torch.utils.data import random_split
from torch.utils.data import Dataset

class DummyDataset(Dataset):
    def __init__(self, length):
        self.data = list(range(length))
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

full_dataset = DummyDataset(100)
train_size = 70
val_size = 20
test_size = 10
train_set, val_set, test_set = random_split(full_dataset, [train_size, val_size, test_size])

print(len(train_set), len(val_set), len(test_set))

A33 33 34

B100 0 0

C70 20 10

D50 25 25

Attempts:

2 left

🧠 Conceptual

intermediate

1:30remaining

Purpose of validation set in train/val/test split

What is the main purpose of the validation set in a train/val/test split during machine learning model training?

ATo tune hyperparameters and prevent overfitting

BTo train the model with more data

CTo evaluate the model's performance on unseen data after training

DTo test the model's final accuracy before deployment

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Choosing split ratios for train/val/test

Which of the following split ratios is most appropriate for a dataset with 10,000 samples to ensure reliable training, validation, and testing?

A60% train, 20% validation, 20% test

B50% train, 25% validation, 25% test

C90% train, 5% validation, 5% test

D80% train, 10% validation, 10% test

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify error in PyTorch dataset splitting code

What error will the following code raise when trying to split a dataset into train, validation, and test sets?

PyTorch

from torch.utils.data import random_split
full_dataset = list(range(50))
train_set, val_set, test_set = random_split(full_dataset, [30, 15, 10])

ATypeError: 'list' object has no attribute '__len__'

BValueError: Sum of input lengths does not equal the length of the input dataset

CTypeError: random_split expects a Dataset object, not a list

DNo error, splits successfully

Attempts:

2 left

❓ Model Choice

expert

2:30remaining

Best approach to split highly imbalanced dataset

You have a highly imbalanced classification dataset with 1% positive and 99% negative samples. Which approach is best to split the dataset into train, validation, and test sets to maintain class distribution?

AUse stratified splitting to keep class proportions in each subset

BManually shuffle and split the dataset randomly

CUse random_split from PyTorch directly without stratification

DSplit only into train and test, skip validation

Attempts:

2 left