0
0
MlopsProgramBeginner · 2 min read

Python Sklearn Program to Predict Diabetes Using ML

Use sklearn's load_diabetes dataset and LogisticRegression to train a model with model.fit(X_train, y_train) and predict diabetes with model.predict(X_test).
📋

Examples

InputPatient data: [0.038075906, 0.05068012, 0.061696206, 0.021872354, -0.0442235, -0.03482076, -0.04340085, -0.00259226, 0.01990749, -0.01764613]
OutputPredicted diabetes outcome: 0 (no diabetes)
InputPatient data: [0.08529988, -0.04664164, 0.02087577, 0.03168012, 0.08529988, -0.04664164, 0.02087577, 0.03168012, 0.08529988, -0.04664164]
OutputPredicted diabetes outcome: 1 (diabetes)
InputPatient data: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
OutputPredicted diabetes outcome: 0 (no diabetes)
🧠

How to Think About It

First, get diabetes data with features and labels. Then split data into training and testing sets. Train a logistic regression model on training data. Finally, use the model to predict diabetes on new patient data.
📐

Algorithm

1
Load diabetes dataset with features and target labels
2
Split dataset into training and testing parts
3
Create logistic regression model
4
Train model using training data
5
Predict diabetes on test data
6
Print prediction results
💻

Code

sklearn
from sklearn.datasets import load_diabetes
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load dataset
data = load_diabetes()
X = data.data
# Binarize target: diabetes or not (threshold 140)
y = (data.target > 140).astype(int)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print('Predictions:', predictions)
Output
Predictions: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
🔍

Dry Run

Let's trace a sample patient data through the model prediction.

1

Load and prepare data

Features X shape: (442, 10), Target y binarized with threshold 140

2

Split data

Training set size: 353, Testing set size: 89

3

Train logistic regression

Model learns weights from training data

4

Predict on test data

Model outputs array of 0s and 1s for diabetes prediction

Test Sample IndexPredicted Label
00
10
20
30
40
💡

Why This Works

Step 1: Load and binarize data

We use load_diabetes() to get features and convert continuous target to binary labels with target > 140 to mark diabetes presence.

Step 2: Train logistic regression

Logistic regression fits a model to separate diabetes vs no diabetes by learning weights for each feature.

Step 3: Predict diabetes

The trained model predicts 0 or 1 for new data, indicating absence or presence of diabetes.

🔄

Alternative Approaches

Random Forest Classifier
sklearn
from sklearn.datasets import load_diabetes
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Load and binarize target
data = load_diabetes()
X = data.data
y = (data.target > 140).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predictions:', predictions)
Random Forest can capture complex patterns better but is slower and less interpretable than logistic regression.
Support Vector Machine (SVM)
sklearn
from sklearn.datasets import load_diabetes
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split

# Load and binarize target
data = load_diabetes()
X = data.data
y = (data.target > 140).astype(int)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = SVC(kernel='linear')
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print('Predictions:', predictions)
SVM with linear kernel works well for binary classification but can be slower on large datasets.

Complexity: O(n * d * i) time, O(n * d) space

Time Complexity

Training logistic regression takes O(n * d * i) where n is samples, d is features, and i is iterations for convergence.

Space Complexity

Storing data and model weights requires O(n * d) space; model weights are O(d).

Which Approach is Fastest?

Logistic regression is fastest and simplest; Random Forest and SVM are slower but may improve accuracy.

ApproachTimeSpaceBest For
Logistic RegressionO(n * d * i)O(n * d)Fast, interpretable binary classification
Random ForestO(trees * n * d * log n)O(n * d)Complex patterns, better accuracy
SVMO(n^2 * d)O(n * d)High-dimensional data, margin-based classification
💡
Always split your data into training and testing sets to check how well your model predicts new data.
⚠️
Beginners often forget to convert continuous diabetes targets into binary labels before classification.