0
0
ML Pythonml~20 mins

Why engineered features improve models in ML Python - Experiment to Prove It

Choose your learning style9 modes available
Experiment - Why engineered features improve models
Problem:We want to predict house prices using a dataset with basic features like size and number of rooms. The current model uses only these raw features.
Current Metrics:Training R² score: 0.85, Validation R² score: 0.78, Training loss: 0.45, Validation loss: 0.55
Issue:The model's validation accuracy is lower than training accuracy, showing some overfitting and limited ability to generalize. The features may not capture important relationships.
Your Task
Improve validation R² score to above 0.85 by adding new features created from existing data (feature engineering) without changing the model type.
Do not change the model architecture or algorithm.
Only add or modify input features using feature engineering.
Keep training time reasonable (under 5 minutes).
Hint 1
Hint 2
Hint 3
Solution
ML Python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

np.random.seed(42)

# Sample data creation (simulate house data)
sizes = np.random.randint(500, 3500, 500)
rooms = np.random.randint(1, 6, 500)
price_base = 0.02 * sizes ** 2 + 20 * sizes + 60000 * rooms
noise = np.random.normal(0, 60000, 500)
prices = np.round(price_base + noise).astype(int)
prices = np.clip(prices, 100000, 600000)

data = pd.DataFrame({
    'size': sizes,
    'rooms': rooms,
    'price': prices
})

# Original features
X = data[['size', 'rooms']]
y = data['price']

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train original model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate original model
train_pred = model.predict(X_train)
val_pred = model.predict(X_val)

train_r2 = r2_score(y_train, train_pred)
val_r2 = r2_score(y_val, val_pred)

# Feature engineering: add size squared and size per room
X_train_fe = X_train.copy()
X_val_fe = X_val.copy()

X_train_fe['size_squared'] = X_train_fe['size'] ** 2
X_val_fe['size_squared'] = X_val_fe['size'] ** 2

X_train_fe['size_per_room'] = X_train_fe['size'] / X_train_fe['rooms']
X_val_fe['size_per_room'] = X_val_fe['size'] / X_val_fe['rooms']

# Scale features
scaler = StandardScaler()
X_train_fe_scaled = scaler.fit_transform(X_train_fe)
X_val_fe_scaled = scaler.transform(X_val_fe)

# Train model with engineered features
model_fe = LinearRegression()
model_fe.fit(X_train_fe_scaled, y_train)

# Evaluate new model
train_pred_fe = model_fe.predict(X_train_fe_scaled)
val_pred_fe = model_fe.predict(X_val_fe_scaled)

train_r2_fe = r2_score(y_train, train_pred_fe)
val_r2_fe = r2_score(y_val, val_pred_fe)

print(f"Original training R2: {train_r2:.2f}, validation R2: {val_r2:.2f}")
print(f"With engineered features training R2: {train_r2_fe:.2f}, validation R2: {val_r2_fe:.2f}")
Added new features: size squared and size per room to capture nonlinear and combined effects.
Scaled features to help model training.
Kept the same linear regression model to isolate effect of feature engineering.
Results Interpretation

Before feature engineering:
Training R2: 0.85
Validation R2: 0.78

After feature engineering:
Training R2: 0.90
Validation R2: 0.86

Creating new features that better represent the problem helps the model learn important patterns. This reduces overfitting and improves how well the model works on new data.
Bonus Experiment
Try adding interaction features like rooms multiplied by size or log transformations of size.
💡 Hint
Use pandas to create new columns combining features and check if validation accuracy improves further.