Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is training data preparation in machine learning?
Training data preparation is the process of cleaning, organizing, and formatting raw data so that a machine learning model can learn from it effectively.
Click to reveal answer
beginner
Why do we need to clean data before training a model?
Cleaning data removes errors, missing values, and inconsistencies that could confuse the model and reduce its accuracy.
Click to reveal answer
intermediate
What is feature scaling and why is it important?
Feature scaling adjusts the range of data features so they have similar scales, helping the model learn faster and perform better.
Click to reveal answer
beginner
Explain the difference between training, validation, and test data.
Training data is used to teach the model. Validation data helps tune the model’s settings. Test data checks how well the model works on new, unseen data.
Click to reveal answer
intermediate
What is data augmentation and when is it used?
Data augmentation creates new training examples by modifying existing data, like flipping images. It is used to increase data size and improve model robustness.
Click to reveal answer
Which step is NOT part of training data preparation?
ACleaning missing values
BSplitting data into sets
CTraining the model
DScaling features
✗ Incorrect
Training the model happens after data preparation, not during it.
Why do we split data into training, validation, and test sets?
ATo evaluate model performance fairly
BTo remove errors from data
CTo make the dataset smaller
DTo speed up data cleaning
✗ Incorrect
Splitting data helps test how well the model works on new data and tune it properly.
What does feature scaling do?
AAdds new data points
BRemoves missing data
CSplits data into groups
DChanges data to a similar range
✗ Incorrect
Feature scaling adjusts data values to a similar range for better model learning.
Data augmentation is mainly used to:
ACreate more training examples
BClean data errors
CSplit data into sets
DScale features
✗ Incorrect
Data augmentation increases the size of training data by creating new examples.
Which of these is a common data cleaning task?
ANormalizing features
BRemoving duplicates
CSplitting data
DTraining the model
✗ Incorrect
Removing duplicate records is a common cleaning step to improve data quality.
Describe the key steps involved in preparing training data for a machine learning model.
Think about what you do to raw data before feeding it to a model.
You got /5 concepts.
Explain why splitting data into training, validation, and test sets is important.
Consider how you check if a model works well on new data.
You got /5 concepts.
Practice
(1/5)
1. What is the main purpose of training data preparation in machine learning?
easy
A. To clean and organize data for better model learning
B. To create the final model architecture
C. To deploy the model to production
D. To write the code for model training
Solution
Step 1: Understand the role of training data preparation
Training data preparation involves cleaning and organizing data so the model can learn effectively.
Step 2: Differentiate from other steps in machine learning
Creating model architecture, deployment, and coding are separate steps after data preparation.
Final Answer:
To clean and organize data for better model learning -> Option A
Quick Check:
Training data preparation = cleaning and organizing data [OK]
Hint: Focus on data cleaning and organizing for training [OK]
Common Mistakes:
Confusing data preparation with model building
Thinking deployment is part of data preparation
Assuming coding is data preparation
2. Which of the following is the correct way to split data into training and testing sets in Python using scikit-learn?
easy
A. split_train_test(data, 0.2)
B. train_test(data, split=0.2)
C. train_test_split(data, test_size=0.2)
D. test_train_split(data, size=0.2)
Solution
Step 1: Recall the scikit-learn function for splitting data
The correct function is train_test_split with parameters like test_size.
Step 2: Check the syntax of each option
Only train_test_split(data, test_size=0.2) uses the correct function name and parameter syntax.
Final Answer:
train_test_split(data, test_size=0.2) -> Option C
Quick Check:
Correct function and parameter = train_test_split(data, test_size=0.2) [OK]
Hint: Remember scikit-learn's train_test_split function name [OK]
Common Mistakes:
Using wrong function names
Incorrect parameter names
Mixing order of parameters
3. Given the code below, what will be the output of print(X_train.shape, X_test.shape)?
from sklearn.model_selection import train_test_split
import numpy as np
X = np.arange(20).reshape(10, 2)
X_train, X_test = train_test_split(X, test_size=0.3, random_state=42)
medium
A. (7, 2) (3, 2)
B. (3, 2) (7, 2)
C. (10, 2) (0, 2)
D. (5, 2) (5, 2)
Solution
Step 1: Understand the data shape and split ratio
The data X has 10 rows and 2 columns. test_size=0.3 means 30% data for testing (3 rows) and 70% for training (7 rows).
Step 2: Calculate the shapes of training and testing sets
Training set shape: (7, 2), Testing set shape: (3, 2).
Final Answer:
(7, 2) (3, 2) -> Option A
Quick Check:
70% train = 7 rows, 30% test = 3 rows [OK]
Hint: Calculate rows by multiplying total by split ratio [OK]
Common Mistakes:
Swapping train and test sizes
Ignoring the shape's second dimension
Misunderstanding test_size meaning
4. Identify the error in the following code snippet for normalizing data using MinMaxScaler:
B. MinMaxScaler requires data as a numpy array, not list
C. fit_transform should be called on scaler.fit(X).transform(X)
D. No error, code runs correctly
Solution
Step 1: Check input data type compatibility
MinMaxScaler accepts lists or numpy arrays as input, so list input is valid.
Step 2: Verify method usage
Calling scaler.fit_transform(X) is the correct way to fit and transform data in one step.
Final Answer:
No error, code runs correctly -> Option D
Quick Check:
MinMaxScaler works with lists and fit_transform method [OK]
Hint: MinMaxScaler accepts lists and arrays directly [OK]
Common Mistakes:
Thinking input must be numpy array
Misusing fit and transform methods
Assuming scaler rejects negative values
5. You have a dataset with categorical text features and numeric features. Which sequence of steps correctly prepares the data for training a machine learning model?
hard
A. Split data, encode categorical features, normalize numeric features, then clean missing values
B. Clean missing values, encode categorical features, normalize numeric features, then split data
C. Normalize numeric features, clean missing values, split data, then encode categorical features
D. Encode categorical features, split data, clean missing values, then normalize numeric features
Solution
Step 1: Clean missing values first
Cleaning missing data ensures no errors during encoding or normalization.
Step 2: Encode categorical features before normalization
Categorical data must be converted to numbers before normalization.
Step 3: Normalize numeric features and then split data
Normalization scales numeric data; splitting last avoids data leakage.
Final Answer:
Clean missing values, encode categorical features, normalize numeric features, then split data -> Option B