Python ML Program to Predict Customer Churn with sklearn
train_test_split, training a classifier like RandomForestClassifier, and evaluating accuracy with accuracy_score.Examples
How to Think About It
Algorithm
Code
from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score import pandas as pd # Sample data data = pd.DataFrame({ 'age': [25, 45, 35, 50, 23], 'tenure': [12, 24, 36, 48, 6], 'usage': [200, 150, 300, 400, 100], 'churn': [0, 1, 0, 0, 1] }) X = data[['age', 'tenure', 'usage']] y = data['churn'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42) model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test) accuracy = accuracy_score(y_test, predictions) print(f"Model accuracy: {accuracy:.2f}") print(f"Predictions: {predictions.tolist()}")
Dry Run
Let's trace the sample data through the churn prediction code
Load data
Data has 5 customers with features age, tenure, usage, and churn labels.
Split data
Training set: 3 samples, Test set: 2 samples selected randomly.
Train model
RandomForestClassifier learns patterns from training features and churn labels.
Predict churn
Model predicts churn for test samples.
Calculate accuracy
Compare predicted labels with actual test labels to get accuracy 1.00.
| Step | Action | Data/Result |
|---|---|---|
| 1 | Load data | [{'age':25,'tenure':12,'usage':200,'churn':0}, ...] |
| 2 | Split data | Train: 3 samples, Test: 2 samples |
| 3 | Train model | Model trained on 3 samples |
| 4 | Predict churn | Predictions: [0, 1] |
| 5 | Calculate accuracy | Accuracy: 1.00 |
Why This Works
Step 1: Data Preparation
We select features like age, tenure, and usage that influence churn and separate the target churn column.
Step 2: Train-Test Split
Splitting data with train_test_split ensures the model learns from one part and is tested on unseen data for fair evaluation.
Step 3: Model Training
Random Forest learns patterns by building many decision trees and combining their results to improve prediction accuracy.
Step 4: Evaluation
Accuracy score compares predicted churn labels to actual labels to measure model performance.
Alternative Approaches
from sklearn.linear_model import LogisticRegression model = LogisticRegression(random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test) accuracy = accuracy_score(y_test, predictions) print(f"Logistic Regression accuracy: {accuracy:.2f}")
from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier(random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test) accuracy = accuracy_score(y_test, predictions) print(f"Gradient Boosting accuracy: {accuracy:.2f}")
Complexity: O(n * m * t) time, O(n * m) space
Time Complexity
Training Random Forest takes time proportional to number of samples n, features m, and number of trees t.
Space Complexity
Stores training data and trees, so space grows with data size n * m.
Which Approach is Fastest?
Logistic Regression trains fastest but may be less accurate; Gradient Boosting is slower but can improve accuracy.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Random Forest | O(n*m*t) | O(n*m) | Balanced accuracy and speed |
| Logistic Regression | O(n*m) | O(n*m) | Simple, fast models |
| Gradient Boosting | O(n*m*t) | O(n*m) | High accuracy, slower training |