MlopsProgramBeginner · 2 min read

Python sklearn Program to Predict House Price

Use LinearRegression from sklearn.linear_model to train a model on house features and prices, then predict prices with model.predict() after fitting the model with model.fit(X_train, y_train).

📋

Examples

Input[[1200, 3], [1500, 4], [800, 2]]

Output[240000, 310000, 180000]

Input[[1000, 2], [2000, 5], [1500, 3]]

Output[200000, 450000, 320000]

Input[[0, 0], [500, 1], [3000, 6]]

Output[0, 100000, 600000]

🧠

How to Think About It

First, collect house data with features like size and number of rooms and their prices. Then split data into training and testing sets. Train a simple linear regression model on training data to learn the relationship. Finally, use the trained model to predict house prices for new data.

📐

Algorithm

Collect house features and prices data.

Split data into training and testing sets.

Create a linear regression model.

Train the model with training data.

Predict house prices using the trained model on test data.

Evaluate the model's accuracy.

💻

Code

sklearn

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data: [size in sqft, number of rooms]
X = np.array([[1200, 3], [1500, 4], [800, 2], [2000, 5], [1700, 4]])
# Prices in $
y = np.array([240000, 310000, 180000, 450000, 360000])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print('Predicted prices:', predictions)
print('Actual prices:', y_test)

🔍

Dry Run

Let's trace the sample data through the code to see how the model learns and predicts house prices.

Prepare data

X = [[1200,3],[1500,4],[800,2],[2000,5],[1700,4]], y = [240000,310000,180000,450000,360000]

Split data

Training data: 4 samples, Test data: 1 sample (e.g., X_test=[[1700,4]], y_test=[360000])

Train model

Model learns coefficients to fit price = a*size + b*rooms + c

Predict

Model predicts price for X_test: [1700,4] -> ~360000

Step	Action	Values
1	Data prepared	[[1200,3],[1500,4],[800,2],[2000,5],[1700,4]], [240000,310000,180000,450000,360000]
2	Split data	Train: 4 samples, Test: 1 sample
3	Train model	Fit coefficients for features
4	Predict	Input: [1700,4], Output: ~360000

💡

Why This Works

Step 1: Data preparation

We organize house features and prices into arrays so the model can learn patterns.

Step 2: Training the model

The linear regression model finds the best line that fits the relationship between features and prices using fit().

Step 3: Making predictions

After training, the model uses learned coefficients to predict prices for new houses with predict().

🔄

Alternative Approaches

Decision Tree Regressor

sklearn

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1200,3],[1500,4],[800,2],[2000,5],[1700,4]])
y = np.array([240000,310000,180000,450000,360000])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DecisionTreeRegressor()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predicted prices:', predictions)
print('Actual prices:', y_test)

Decision trees can capture non-linear relationships but may overfit small data.

Random Forest Regressor

sklearn

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import numpy as np
X = np.array([[1200,3],[1500,4],[800,2],[2000,5],[1700,4]])
y = np.array([240000,310000,180000,450000,360000])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=10, random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predicted prices:', predictions)
print('Actual prices:', y_test)

Random forests reduce overfitting and improve accuracy but need more computation.

⚡

Complexity: O(n * m) time, O(n * m) space

Time Complexity

Training linear regression takes time proportional to the number of samples (n) times the number of features (m). Prediction is faster, just a dot product.

Space Complexity

The model stores coefficients for each feature, so space depends on the number of features and samples stored during training.

Which Approach is Fastest?

Linear regression is fastest and simplest; decision trees and random forests are slower but can model complex patterns better.

Approach	Time	Space	Best For
Linear Regression	O(n*m)	O(m)	Simple, fast, linear relationships
Decision Tree	O(nmlog n)	O(n)	Non-linear data, interpretability
Random Forest	O(tnm*log n)	O(t*n)	Better accuracy, handles complex data, slower

💡

Always split your data into training and testing sets to check how well your model predicts new data.

⚠️

Beginners often forget to reshape input data or split data properly, causing errors or poor predictions.