Introduction
We split time series data into training and testing parts to check if our model can predict future values well, without cheating by looking ahead.
Jump into concepts and practice - no test required
train_size = int(len(data) * 0.8) train = data[:train_size] test = data[train_size:]
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] train_size = int(len(data) * 0.7) train = data[:train_size] test = data[train_size:]
import pandas as pd series = pd.Series(range(100)) train = series[:80] test = series[80:]
import numpy as np from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # Create a simple time series: y = 2*x + noise np.random.seed(0) x = np.arange(50).reshape(-1, 1) y = 2 * x.flatten() + np.random.normal(0, 5, 50) # Split data: first 40 for training, last 10 for testing train_size = 40 x_train, y_train = x[:train_size], y[:train_size] x_test, y_test = x[train_size:], y[train_size:] # Train linear regression model model = LinearRegression() model.fit(x_train, y_train) # Predict on test data y_pred = model.predict(x_test) # Calculate mean squared error mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error on test data: {mse:.2f}") print(f"Predictions: {y_pred.round(2)}")
data into 80% train and 20% test sets while preserving order?test if data has 1000 samples?
split_index = int(len(data) * 0.75) train = data[:split_index] test = data[split_index:]
data:
from sklearn.model_selection import train_test_split train, test = train_test_split(data, test_size=0.2)What is the main problem with this approach?