Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Training data preparation in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Training Data Preparation Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Why is data normalization important before training a model?

Imagine you have a dataset with features measured in very different units, like height in centimeters and income in dollars. Why should you normalize this data before training a machine learning model?

ANormalization increases the size of the dataset to improve model accuracy.
BNormalization removes missing values automatically from the dataset.
CNormalization ensures all features contribute equally by scaling them to a similar range, preventing bias towards features with larger values.
DNormalization converts categorical data into numerical labels.
Attempts:
2 left
💡 Hint

Think about how different scales can affect the learning process of a model.

Predict Output
intermediate
2:00remaining
Output of data splitting code

What is the output of the following Python code that splits data into training and testing sets?

Prompt Engineering / GenAI
from sklearn.model_selection import train_test_split
X = list(range(10))
y = list(range(10))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(len(X_train), len(X_test))
A3 7
B7 3
C10 0
D0 10
Attempts:
2 left
💡 Hint

Check how the test_size parameter affects the split.

Model Choice
advanced
2:00remaining
Best data preparation for text classification

You want to train a model to classify movie reviews as positive or negative. Which data preparation step is most important before training?

AConvert text to lowercase, remove punctuation, and tokenize words.
BNormalize numerical features to zero mean and unit variance.
CFill missing values with the mean of the column.
DEncode categorical variables using one-hot encoding.
Attempts:
2 left
💡 Hint

Think about what you do to raw text before feeding it to a model.

Metrics
advanced
2:00remaining
Evaluating data quality impact on model accuracy

You train two models on the same task. Model A uses raw data with many missing values. Model B uses data where missing values were properly handled. Which metric difference would best show the impact of data preparation?

AModel A has lower accuracy and lower loss than Model B.
BModel A has higher accuracy but higher loss than Model B.
CBoth models have the same accuracy but different loss values.
DModel B has higher accuracy and lower loss than Model A.
Attempts:
2 left
💡 Hint

Good data preparation usually improves both accuracy and loss.

🔧 Debug
expert
2:00remaining
Identify the error in data preprocessing code

What error will this Python code raise when preparing data for training?

Prompt Engineering / GenAI
import numpy as np
from sklearn.preprocessing import StandardScaler
X = np.array([[1, 2], [3, 4], [5, 6]])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(X_scaled[3])
AIndexError: index 3 is out of bounds for axis 0 with size 3
BTypeError: 'StandardScaler' object is not callable
CValueError: could not convert string to float
DNo error, prints scaled values
Attempts:
2 left
💡 Hint

Check the size of the array and the index accessed.

Practice

(1/5)
1. What is the main purpose of training data preparation in machine learning?
easy
A. To clean and organize data for better model learning
B. To create the final model architecture
C. To deploy the model to production
D. To write the code for model training

Solution

  1. Step 1: Understand the role of training data preparation

    Training data preparation involves cleaning and organizing data so the model can learn effectively.
  2. Step 2: Differentiate from other steps in machine learning

    Creating model architecture, deployment, and coding are separate steps after data preparation.
  3. Final Answer:

    To clean and organize data for better model learning -> Option A
  4. Quick Check:

    Training data preparation = cleaning and organizing data [OK]
Hint: Focus on data cleaning and organizing for training [OK]
Common Mistakes:
  • Confusing data preparation with model building
  • Thinking deployment is part of data preparation
  • Assuming coding is data preparation
2. Which of the following is the correct way to split data into training and testing sets in Python using scikit-learn?
easy
A. split_train_test(data, 0.2)
B. train_test(data, split=0.2)
C. train_test_split(data, test_size=0.2)
D. test_train_split(data, size=0.2)

Solution

  1. Step 1: Recall the scikit-learn function for splitting data

    The correct function is train_test_split with parameters like test_size.
  2. Step 2: Check the syntax of each option

    Only train_test_split(data, test_size=0.2) uses the correct function name and parameter syntax.
  3. Final Answer:

    train_test_split(data, test_size=0.2) -> Option C
  4. Quick Check:

    Correct function and parameter = train_test_split(data, test_size=0.2) [OK]
Hint: Remember scikit-learn's train_test_split function name [OK]
Common Mistakes:
  • Using wrong function names
  • Incorrect parameter names
  • Mixing order of parameters
3. Given the code below, what will be the output of print(X_train.shape, X_test.shape)?
from sklearn.model_selection import train_test_split
import numpy as np
X = np.arange(20).reshape(10, 2)
X_train, X_test = train_test_split(X, test_size=0.3, random_state=42)
medium
A. (7, 2) (3, 2)
B. (3, 2) (7, 2)
C. (10, 2) (0, 2)
D. (5, 2) (5, 2)

Solution

  1. Step 1: Understand the data shape and split ratio

    The data X has 10 rows and 2 columns. test_size=0.3 means 30% data for testing (3 rows) and 70% for training (7 rows).
  2. Step 2: Calculate the shapes of training and testing sets

    Training set shape: (7, 2), Testing set shape: (3, 2).
  3. Final Answer:

    (7, 2) (3, 2) -> Option A
  4. Quick Check:

    70% train = 7 rows, 30% test = 3 rows [OK]
Hint: Calculate rows by multiplying total by split ratio [OK]
Common Mistakes:
  • Swapping train and test sizes
  • Ignoring the shape's second dimension
  • Misunderstanding test_size meaning
4. Identify the error in the following code snippet for normalizing data using MinMaxScaler:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
X_scaled = scaler.fit_transform(X)
print(X_scaled)
medium
A. MinMaxScaler cannot handle negative values
B. MinMaxScaler requires data as a numpy array, not list
C. fit_transform should be called on scaler.fit(X).transform(X)
D. No error, code runs correctly

Solution

  1. Step 1: Check input data type compatibility

    MinMaxScaler accepts lists or numpy arrays as input, so list input is valid.
  2. Step 2: Verify method usage

    Calling scaler.fit_transform(X) is the correct way to fit and transform data in one step.
  3. Final Answer:

    No error, code runs correctly -> Option D
  4. Quick Check:

    MinMaxScaler works with lists and fit_transform method [OK]
Hint: MinMaxScaler accepts lists and arrays directly [OK]
Common Mistakes:
  • Thinking input must be numpy array
  • Misusing fit and transform methods
  • Assuming scaler rejects negative values
5. You have a dataset with categorical text features and numeric features. Which sequence of steps correctly prepares the data for training a machine learning model?
hard
A. Split data, encode categorical features, normalize numeric features, then clean missing values
B. Clean missing values, encode categorical features, normalize numeric features, then split data
C. Normalize numeric features, clean missing values, split data, then encode categorical features
D. Encode categorical features, split data, clean missing values, then normalize numeric features

Solution

  1. Step 1: Clean missing values first

    Cleaning missing data ensures no errors during encoding or normalization.
  2. Step 2: Encode categorical features before normalization

    Categorical data must be converted to numbers before normalization.
  3. Step 3: Normalize numeric features and then split data

    Normalization scales numeric data; splitting last avoids data leakage.
  4. Final Answer:

    Clean missing values, encode categorical features, normalize numeric features, then split data -> Option B
  5. Quick Check:

    Proper order: clean -> encode -> normalize -> split [OK]
Hint: Always clean first, encode before normalize, split last [OK]
Common Mistakes:
  • Splitting data before cleaning causes leakage
  • Normalizing before encoding categorical data
  • Encoding after splitting leads to inconsistent categories