0
0
MlopsProgramBeginner · 2 min read

Python ML Program to Predict Customer Churn with sklearn

Use sklearn to build a churn prediction model by loading data, splitting it with train_test_split, training a classifier like RandomForestClassifier, and evaluating accuracy with accuracy_score.
📋

Examples

InputCustomer data with features like age, tenure, and service usage
OutputModel accuracy: 0.85, Predictions: [0, 1, 0, 0, 1]
InputNew customer with features: age=30, tenure=12, usage=300
OutputPredicted churn: 0 (no churn)
InputCustomer with missing data
OutputHandled missing values, model accuracy: 0.82
🧠

How to Think About It

To predict churn, first collect customer data with features that influence churn. Then, clean and prepare the data by handling missing values and encoding categories. Split the data into training and testing sets. Train a machine learning model like Random Forest on the training data. Finally, test the model on unseen data and measure accuracy to see how well it predicts churn.
📐

Algorithm

1
Load customer churn dataset
2
Clean and preprocess data (handle missing values, encode categories)
3
Split data into training and testing sets
4
Train a Random Forest classifier on training data
5
Predict churn on test data
6
Calculate and print accuracy score
💻

Code

sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Sample data
data = pd.DataFrame({
    'age': [25, 45, 35, 50, 23],
    'tenure': [12, 24, 36, 48, 6],
    'usage': [200, 150, 300, 400, 100],
    'churn': [0, 1, 0, 0, 1]
})

X = data[['age', 'tenure', 'usage']]
y = data['churn']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f"Model accuracy: {accuracy:.2f}")
print(f"Predictions: {predictions.tolist()}")
Output
Model accuracy: 1.00 Predictions: [0, 1]
🔍

Dry Run

Let's trace the sample data through the churn prediction code

1

Load data

Data has 5 customers with features age, tenure, usage, and churn labels.

2

Split data

Training set: 3 samples, Test set: 2 samples selected randomly.

3

Train model

RandomForestClassifier learns patterns from training features and churn labels.

4

Predict churn

Model predicts churn for test samples.

5

Calculate accuracy

Compare predicted labels with actual test labels to get accuracy 1.00.

StepActionData/Result
1Load data[{'age':25,'tenure':12,'usage':200,'churn':0}, ...]
2Split dataTrain: 3 samples, Test: 2 samples
3Train modelModel trained on 3 samples
4Predict churnPredictions: [0, 1]
5Calculate accuracyAccuracy: 1.00
💡

Why This Works

Step 1: Data Preparation

We select features like age, tenure, and usage that influence churn and separate the target churn column.

Step 2: Train-Test Split

Splitting data with train_test_split ensures the model learns from one part and is tested on unseen data for fair evaluation.

Step 3: Model Training

Random Forest learns patterns by building many decision trees and combining their results to improve prediction accuracy.

Step 4: Evaluation

Accuracy score compares predicted churn labels to actual labels to measure model performance.

🔄

Alternative Approaches

Logistic Regression
sklearn
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Logistic Regression accuracy: {accuracy:.2f}")
Simpler and faster but may be less accurate on complex data.
Gradient Boosting Classifier
sklearn
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Gradient Boosting accuracy: {accuracy:.2f}")
Often more accurate but slower to train.

Complexity: O(n * m * t) time, O(n * m) space

Time Complexity

Training Random Forest takes time proportional to number of samples n, features m, and number of trees t.

Space Complexity

Stores training data and trees, so space grows with data size n * m.

Which Approach is Fastest?

Logistic Regression trains fastest but may be less accurate; Gradient Boosting is slower but can improve accuracy.

ApproachTimeSpaceBest For
Random ForestO(n*m*t)O(n*m)Balanced accuracy and speed
Logistic RegressionO(n*m)O(n*m)Simple, fast models
Gradient BoostingO(n*m*t)O(n*m)High accuracy, slower training
💡
Always split your data into training and testing sets to fairly evaluate your churn prediction model.
⚠️
Not splitting data properly can cause the model to overfit and give misleadingly high accuracy.