Challenge - 5 Problems

🎖️

Train-Test Split Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Why do we use a train-test split in machine learning?

Choose the best reason why splitting data into training and testing sets is important.

ATo evaluate how well the model performs on unseen data.

BTo increase the size of the training data for better learning.

CTo reduce the number of features in the dataset.

DTo speed up the training process by using less data.

Attempts:

2 left

❓ Predict Output

intermediate

2:00remaining

Output of train-test split sizes

What will be the output sizes of training and testing sets after this code runs?

ML Python

from sklearn.model_selection import train_test_split
X = list(range(100))
y = [x * 2 for x in X]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(len(X_train), len(X_test))

A75 25

B25 75

C70 30

D80 20

Attempts:

2 left

❓ Hyperparameter

advanced

2:00remaining

Choosing the test_size parameter

Which test_size value is best if you want to maximize training data but still have a reliable test set?

A0.05

B0.2

C0.5

D0.8

Attempts:

2 left

🔧 Debug

advanced

2:00remaining

Identify the error in this train-test split code

What error will this code raise?

ML Python

from sklearn.model_selection import train_test_split
X = [1, 2, 3]
y = [4, 5]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

ANo error, code runs successfully

BTypeError: test_size must be an integer or float

CSyntaxError: invalid syntax

DValueError: Found input variables with inconsistent numbers of samples

Attempts:

2 left

❓ Model Choice

expert

3:00remaining

Best practice for train-test split with imbalanced classes

You have a classification dataset with very imbalanced classes. Which train-test split approach is best to keep class proportions consistent?

ASplit data manually by slicing the dataset in order.

BRandomly split data without stratify to avoid bias.

CUse train_test_split with stratify parameter set to the target labels.

DUse only training data and skip testing to avoid imbalance issues.

Attempts:

2 left