MlopsProgramBeginner · 2 min read

Python ML Program to Predict Customer Churn with sklearn

Use sklearn to build a churn prediction model by loading data, splitting it with train_test_split, training a classifier like RandomForestClassifier, and evaluating accuracy with accuracy_score.

📋

Examples

InputCustomer data with features like age, tenure, and service usage

OutputModel accuracy: 0.85, Predictions: [0, 1, 0, 0, 1]

InputNew customer with features: age=30, tenure=12, usage=300

OutputPredicted churn: 0 (no churn)

InputCustomer with missing data

OutputHandled missing values, model accuracy: 0.82

🧠

How to Think About It

To predict churn, first collect customer data with features that influence churn. Then, clean and prepare the data by handling missing values and encoding categories. Split the data into training and testing sets. Train a machine learning model like Random Forest on the training data. Finally, test the model on unseen data and measure accuracy to see how well it predicts churn.

📐

Algorithm

Load customer churn dataset

Clean and preprocess data (handle missing values, encode categories)

Split data into training and testing sets

Train a Random Forest classifier on training data

Predict churn on test data

Calculate and print accuracy score

💻

Code

sklearn

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Sample data
data = pd.DataFrame({
    'age': [25, 45, 35, 50, 23],
    'tenure': [12, 24, 36, 48, 6],
    'usage': [200, 150, 300, 400, 100],
    'churn': [0, 1, 0, 0, 1]
})

X = data[['age', 'tenure', 'usage']]
y = data['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f"Model accuracy: {accuracy:.2f}")
print(f"Predictions: {predictions.tolist()}")

Output

Model accuracy: 1.00 Predictions: [0, 1]

🔍

Dry Run

Let's trace the sample data through the churn prediction code

Load data

Data has 5 customers with features age, tenure, usage, and churn labels.

Split data

Training set: 3 samples, Test set: 2 samples selected randomly.

Train model

RandomForestClassifier learns patterns from training features and churn labels.

Predict churn

Model predicts churn for test samples.

Calculate accuracy

Compare predicted labels with actual test labels to get accuracy 1.00.

Step	Action	Data/Result
1	Load data	[{'age':25,'tenure':12,'usage':200,'churn':0}, ...]
2	Split data	Train: 3 samples, Test: 2 samples
3	Train model	Model trained on 3 samples
4	Predict churn	Predictions: [0, 1]
5	Calculate accuracy	Accuracy: 1.00

💡

Why This Works

Step 1: Data Preparation

We select features like age, tenure, and usage that influence churn and separate the target churn column.

Step 2: Train-Test Split

Splitting data with train_test_split ensures the model learns from one part and is tested on unseen data for fair evaluation.

Step 3: Model Training

Random Forest learns patterns by building many decision trees and combining their results to improve prediction accuracy.

Step 4: Evaluation

Accuracy score compares predicted churn labels to actual labels to measure model performance.

🔄

Alternative Approaches

Logistic Regression

sklearn

from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Logistic Regression accuracy: {accuracy:.2f}")

Simpler and faster but may be less accurate on complex data.

Gradient Boosting Classifier

sklearn

from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Gradient Boosting accuracy: {accuracy:.2f}")

Often more accurate but slower to train.

⚡

Complexity: O(n * m * t) time, O(n * m) space

Time Complexity

Training Random Forest takes time proportional to number of samples n, features m, and number of trees t.

Space Complexity

Stores training data and trees, so space grows with data size n * m.

Which Approach is Fastest?

Logistic Regression trains fastest but may be less accurate; Gradient Boosting is slower but can improve accuracy.

Approach	Time	Space	Best For
Random Forest	O(nmt)	O(n*m)	Balanced accuracy and speed
Logistic Regression	O(n*m)	O(n*m)	Simple, fast models
Gradient Boosting	O(nmt)	O(n*m)	High accuracy, slower training

💡

Always split your data into training and testing sets to fairly evaluate your churn prediction model.

⚠️

Not splitting data properly can cause the model to overfit and give misleadingly high accuracy.