0
0
MlopsProgramBeginner · 2 min read

Python sklearn Program to Predict House Price

Use LinearRegression from sklearn.linear_model to train a model on house features and prices, then predict prices with model.predict() after fitting the model with model.fit(X_train, y_train).
📋

Examples

Input[[1200, 3], [1500, 4], [800, 2]]
Output[240000, 310000, 180000]
Input[[1000, 2], [2000, 5], [1500, 3]]
Output[200000, 450000, 320000]
Input[[0, 0], [500, 1], [3000, 6]]
Output[0, 100000, 600000]
🧠

How to Think About It

First, collect house data with features like size and number of rooms and their prices. Then split data into training and testing sets. Train a simple linear regression model on training data to learn the relationship. Finally, use the trained model to predict house prices for new data.
📐

Algorithm

1
Collect house features and prices data.
2
Split data into training and testing sets.
3
Create a linear regression model.
4
Train the model with training data.
5
Predict house prices using the trained model on test data.
6
Evaluate the model's accuracy.
💻

Code

sklearn
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data: [size in sqft, number of rooms]
X = np.array([[1200, 3], [1500, 4], [800, 2], [2000, 5], [1700, 4]])
# Prices in $
y = np.array([240000, 310000, 180000, 450000, 360000])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print('Predicted prices:', predictions)
print('Actual prices:', y_test)
🔍

Dry Run

Let's trace the sample data through the code to see how the model learns and predicts house prices.

1

Prepare data

X = [[1200,3],[1500,4],[800,2],[2000,5],[1700,4]], y = [240000,310000,180000,450000,360000]

2

Split data

Training data: 4 samples, Test data: 1 sample (e.g., X_test=[[1700,4]], y_test=[360000])

3

Train model

Model learns coefficients to fit price = a*size + b*rooms + c

4

Predict

Model predicts price for X_test: [1700,4] -> ~360000

StepActionValues
1Data prepared[[1200,3],[1500,4],[800,2],[2000,5],[1700,4]], [240000,310000,180000,450000,360000]
2Split dataTrain: 4 samples, Test: 1 sample
3Train modelFit coefficients for features
4PredictInput: [1700,4], Output: ~360000
💡

Why This Works

Step 1: Data preparation

We organize house features and prices into arrays so the model can learn patterns.

Step 2: Training the model

The linear regression model finds the best line that fits the relationship between features and prices using fit().

Step 3: Making predictions

After training, the model uses learned coefficients to predict prices for new houses with predict().

🔄

Alternative Approaches

Decision Tree Regressor
sklearn
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1200,3],[1500,4],[800,2],[2000,5],[1700,4]])
y = np.array([240000,310000,180000,450000,360000])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predicted prices:', predictions)
print('Actual prices:', y_test)
Decision trees can capture non-linear relationships but may overfit small data.
Random Forest Regressor
sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1200,3],[1500,4],[800,2],[2000,5],[1700,4]])
y = np.array([240000,310000,180000,450000,360000])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=10, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predicted prices:', predictions)
print('Actual prices:', y_test)
Random forests reduce overfitting and improve accuracy but need more computation.

Complexity: O(n * m) time, O(n * m) space

Time Complexity

Training linear regression takes time proportional to the number of samples (n) times the number of features (m). Prediction is faster, just a dot product.

Space Complexity

The model stores coefficients for each feature, so space depends on the number of features and samples stored during training.

Which Approach is Fastest?

Linear regression is fastest and simplest; decision trees and random forests are slower but can model complex patterns better.

ApproachTimeSpaceBest For
Linear RegressionO(n*m)O(m)Simple, fast, linear relationships
Decision TreeO(n*m*log n)O(n)Non-linear data, interpretability
Random ForestO(t*n*m*log n)O(t*n)Better accuracy, handles complex data, slower
💡
Always split your data into training and testing sets to check how well your model predicts new data.
⚠️
Beginners often forget to reshape input data or split data properly, causing errors or poor predictions.