In time series data, why should we avoid randomly splitting data into training and testing sets?
Think about how time flows and what it means to predict future values.
Random splitting mixes past and future data, letting the model see future information during training, which is unrealistic and causes data leakage.
What is the length of the training and testing sets after this split?
import numpy as np from sklearn.model_selection import train_test_split data = np.arange(10) train, test = train_test_split(data, test_size=0.3, random_state=42) print(len(train), len(test))
Check how train_test_split divides data by default.
train_test_split splits data randomly; with test_size=0.3 on 10 items, training has 7 and testing has 3 items.
Which method correctly splits time series data to avoid data leakage and respect temporal order?
Think about keeping the time order intact.
Using the first 80% as training and last 20% as testing preserves time order and prevents future data leakage.
What is a key consideration when choosing the test size for a time series train-test split?
Think about repeating patterns in time series.
Choosing a test size that covers a full seasonal cycle ensures the model is tested on all seasonal patterns.
What error will this code raise when splitting time series data?
import numpy as np from sklearn.model_selection import train_test_split data = np.arange(10) train, test = train_test_split(data, test_size=0.3, shuffle=False) print(train) print(test)
Check sklearn docs for shuffle parameter behavior.
train_test_split allows shuffle=False; it splits data sequentially without error.