MlopsProgramBeginner · 2 min read

Python Sklearn Program to Predict Diabetes Using ML

Use sklearn's load_diabetes dataset and LogisticRegression to train a model with model.fit(X_train, y_train) and predict diabetes with model.predict(X_test).

📋

Examples

InputPatient data: [0.038075906, 0.05068012, 0.061696206, 0.021872354, -0.0442235, -0.03482076, -0.04340085, -0.00259226, 0.01990749, -0.01764613]

OutputPredicted diabetes outcome: 0 (no diabetes)

InputPatient data: [0.08529988, -0.04664164, 0.02087577, 0.03168012, 0.08529988, -0.04664164, 0.02087577, 0.03168012, 0.08529988, -0.04664164]

OutputPredicted diabetes outcome: 1 (diabetes)

InputPatient data: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

OutputPredicted diabetes outcome: 0 (no diabetes)

🧠

How to Think About It

First, get diabetes data with features and labels. Then split data into training and testing sets. Train a logistic regression model on training data. Finally, use the model to predict diabetes on new patient data.

📐

Algorithm

Load diabetes dataset with features and target labels

Split dataset into training and testing parts

Create logistic regression model

Train model using training data

Predict diabetes on test data

Print prediction results

💻

Code

sklearn

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset
data = load_diabetes()
X = data.data
# Binarize target: diabetes or not (threshold 140)
y = (data.target > 140).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print('Predictions:', predictions)

Output

Predictions: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

🔍

Dry Run

Let's trace a sample patient data through the model prediction.

Load and prepare data

Features X shape: (442, 10), Target y binarized with threshold 140

Split data

Training set size: 353, Testing set size: 89

Train logistic regression

Model learns weights from training data

Predict on test data

Model outputs array of 0s and 1s for diabetes prediction

Test Sample Index	Predicted Label
0	0
1	0
2	0
3	0
4	0

💡

Why This Works

Step 1: Load and binarize data

We use load_diabetes() to get features and convert continuous target to binary labels with target > 140 to mark diabetes presence.

Step 2: Train logistic regression

Logistic regression fits a model to separate diabetes vs no diabetes by learning weights for each feature.

Step 3: Predict diabetes

The trained model predicts 0 or 1 for new data, indicating absence or presence of diabetes.

🔄

Alternative Approaches

Random Forest Classifier

sklearn

from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load and binarize target
data = load_diabetes()
X = data.data
y = (data.target > 140).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predictions:', predictions)

Random Forest can capture complex patterns better but is slower and less interpretable than logistic regression.

Support Vector Machine (SVM)

sklearn

from sklearn.datasets import load_diabetes
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Load and binarize target
data = load_diabetes()
X = data.data
y = (data.target > 140).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = SVC(kernel='linear')
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predictions:', predictions)

SVM with linear kernel works well for binary classification but can be slower on large datasets.

⚡

Complexity: O(n * d * i) time, O(n * d) space

Time Complexity

Training logistic regression takes O(n * d * i) where n is samples, d is features, and i is iterations for convergence.

Space Complexity

Storing data and model weights requires O(n * d) space; model weights are O(d).

Which Approach is Fastest?

Logistic regression is fastest and simplest; Random Forest and SVM are slower but may improve accuracy.

Approach	Time	Space	Best For
Logistic Regression	O(n * d * i)	O(n * d)	Fast, interpretable binary classification
Random Forest	O(trees * n * d * log n)	O(n * d)	Complex patterns, better accuracy
SVM	O(n^2 * d)	O(n * d)	High-dimensional data, margin-based classification

💡

Always split your data into training and testing sets to check how well your model predicts new data.

⚠️

Beginners often forget to convert continuous diabetes targets into binary labels before classification.