0
0
ML Pythonml~20 mins

Train-test split for time series in ML Python - ML Experiment: Train & Evaluate

Choose your learning style9 modes available
Experiment - Train-test split for time series
Problem:You want to predict future values in a time series dataset. Currently, you split the data randomly into training and test sets, which causes data leakage and unrealistic evaluation.
Current Metrics:Training RMSE: 0.15, Test RMSE: 0.50
Issue:Random splitting breaks the time order, causing the model to see future data during training. This leads to over-optimistic training results but poor test performance.
Your Task
Improve the evaluation by splitting the time series data respecting its order. Use a train-test split that keeps earlier data for training and later data for testing.
Do not shuffle the data before splitting.
Keep the test set as the last 20% of the data.
Use the same model architecture and hyperparameters.
Hint 1
Hint 2
Hint 3
Solution
ML Python
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Generate synthetic time series data
time = np.arange(100)
values = 0.5 * time + np.sin(time / 5) + np.random.normal(scale=0.5, size=100)

# Prepare features and target
X = time.reshape(-1, 1)
y = values

# Correct train-test split for time series
split_index = int(len(X) * 0.8)
X_train, X_test = X[:split_index], X[split_index:]
y_train, y_test = y[:split_index], y[split_index:]

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
train_pred = model.predict(X_train)
test_pred = model.predict(X_test)

# Calculate RMSE
train_rmse = mean_squared_error(y_train, train_pred, squared=False)
test_rmse = mean_squared_error(y_test, test_pred, squared=False)

print(f"Training RMSE: {train_rmse:.2f}")
print(f"Test RMSE: {test_rmse:.2f}")
Replaced random train-test split with a time-based split using the first 80% of data for training and last 20% for testing.
Ensured no shuffling to keep time order intact.
Evaluated model with RMSE on both sets to reflect realistic performance.
Results Interpretation

Before: Training RMSE: 0.15, Test RMSE: 0.50 (random split, data leakage)

After: Training RMSE: 0.45, Test RMSE: 0.48 (time-based split, realistic evaluation)

Splitting time series data randomly causes data leakage and overly optimistic training results. Using a time-based split respects the order and gives a more honest measure of model performance.
Bonus Experiment
Try using a rolling window validation approach to evaluate the model on multiple time-based splits.
💡 Hint
Split the data into multiple train-test sets where the training set grows over time and test set moves forward, then average the test errors.