Python sklearn Program to Predict House Price
LinearRegression from sklearn.linear_model to train a model on house features and prices, then predict prices with model.predict() after fitting the model with model.fit(X_train, y_train).Examples
How to Think About It
Algorithm
Code
from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split import numpy as np # Sample data: [size in sqft, number of rooms] X = np.array([[1200, 3], [1500, 4], [800, 2], [2000, 5], [1700, 4]]) # Prices in $ y = np.array([240000, 310000, 180000, 450000, 360000]) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Create and train model model = LinearRegression() model.fit(X_train, y_train) # Predict predictions = model.predict(X_test) print('Predicted prices:', predictions) print('Actual prices:', y_test)
Dry Run
Let's trace the sample data through the code to see how the model learns and predicts house prices.
Prepare data
X = [[1200,3],[1500,4],[800,2],[2000,5],[1700,4]], y = [240000,310000,180000,450000,360000]
Split data
Training data: 4 samples, Test data: 1 sample (e.g., X_test=[[1700,4]], y_test=[360000])
Train model
Model learns coefficients to fit price = a*size + b*rooms + c
Predict
Model predicts price for X_test: [1700,4] -> ~360000
| Step | Action | Values |
|---|---|---|
| 1 | Data prepared | [[1200,3],[1500,4],[800,2],[2000,5],[1700,4]], [240000,310000,180000,450000,360000] |
| 2 | Split data | Train: 4 samples, Test: 1 sample |
| 3 | Train model | Fit coefficients for features |
| 4 | Predict | Input: [1700,4], Output: ~360000 |
Why This Works
Step 1: Data preparation
We organize house features and prices into arrays so the model can learn patterns.
Step 2: Training the model
The linear regression model finds the best line that fits the relationship between features and prices using fit().
Step 3: Making predictions
After training, the model uses learned coefficients to predict prices for new houses with predict().
Alternative Approaches
from sklearn.tree import DecisionTreeRegressor from sklearn.model_selection import train_test_split import numpy as np X = np.array([[1200,3],[1500,4],[800,2],[2000,5],[1700,4]]) y = np.array([240000,310000,180000,450000,360000]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = DecisionTreeRegressor() model.fit(X_train, y_train) predictions = model.predict(X_test) print('Predicted prices:', predictions) print('Actual prices:', y_test)
from sklearn.ensemble import RandomForestRegressor from sklearn.model_selection import train_test_split import numpy as np X = np.array([[1200,3],[1500,4],[800,2],[2000,5],[1700,4]]) y = np.array([240000,310000,180000,450000,360000]) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = RandomForestRegressor(n_estimators=10, random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test) print('Predicted prices:', predictions) print('Actual prices:', y_test)
Complexity: O(n * m) time, O(n * m) space
Time Complexity
Training linear regression takes time proportional to the number of samples (n) times the number of features (m). Prediction is faster, just a dot product.
Space Complexity
The model stores coefficients for each feature, so space depends on the number of features and samples stored during training.
Which Approach is Fastest?
Linear regression is fastest and simplest; decision trees and random forests are slower but can model complex patterns better.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Linear Regression | O(n*m) | O(m) | Simple, fast, linear relationships |
| Decision Tree | O(n*m*log n) | O(n) | Non-linear data, interpretability |
| Random Forest | O(t*n*m*log n) | O(t*n) | Better accuracy, handles complex data, slower |