Imagine you have a dataset with features measured in very different units, like height in centimeters and income in dollars. Why should you normalize this data before training a machine learning model?
Think about how different scales can affect the learning process of a model.
Normalization scales features to a similar range so that no single feature dominates the learning process due to its scale. This helps models learn better and faster.
What is the output of the following Python code that splits data into training and testing sets?
from sklearn.model_selection import train_test_split X = list(range(10)) y = list(range(10)) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) print(len(X_train), len(X_test))
Check how the test_size parameter affects the split.
The test_size=0.3 means 30% of data goes to testing, so 7 items remain for training and 3 for testing.
You want to train a model to classify movie reviews as positive or negative. Which data preparation step is most important before training?
Think about what you do to raw text before feeding it to a model.
Text data needs cleaning like lowercasing, removing punctuation, and tokenizing to convert it into a form the model can understand.
You train two models on the same task. Model A uses raw data with many missing values. Model B uses data where missing values were properly handled. Which metric difference would best show the impact of data preparation?
Good data preparation usually improves both accuracy and loss.
Handling missing values properly improves data quality, leading to better model accuracy and lower loss.
What error will this Python code raise when preparing data for training?
import numpy as np from sklearn.preprocessing import StandardScaler X = np.array([[1, 2], [3, 4], [5, 6]]) scaler = StandardScaler() X_scaled = scaler.fit_transform(X) print(X_scaled[3])
Check the size of the array and the index accessed.
The array has only 3 rows (indices 0,1,2). Accessing index 3 causes an IndexError.