Python Sklearn Program to Predict Diabetes Using ML
load_diabetes dataset and LogisticRegression to train a model with model.fit(X_train, y_train) and predict diabetes with model.predict(X_test).Examples
How to Think About It
Algorithm
Code
from sklearn.datasets import load_diabetes from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split # Load dataset data = load_diabetes() X = data.data # Binarize target: diabetes or not (threshold 140) y = (data.target > 140).astype(int) # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Train model model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) # Predict predictions = model.predict(X_test) print('Predictions:', predictions)
Dry Run
Let's trace a sample patient data through the model prediction.
Load and prepare data
Features X shape: (442, 10), Target y binarized with threshold 140
Split data
Training set size: 353, Testing set size: 89
Train logistic regression
Model learns weights from training data
Predict on test data
Model outputs array of 0s and 1s for diabetes prediction
| Test Sample Index | Predicted Label |
|---|---|
| 0 | 0 |
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 0 |
Why This Works
Step 1: Load and binarize data
We use load_diabetes() to get features and convert continuous target to binary labels with target > 140 to mark diabetes presence.
Step 2: Train logistic regression
Logistic regression fits a model to separate diabetes vs no diabetes by learning weights for each feature.
Step 3: Predict diabetes
The trained model predicts 0 or 1 for new data, indicating absence or presence of diabetes.
Alternative Approaches
from sklearn.datasets import load_diabetes from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split # Load and binarize target data = load_diabetes() X = data.data y = (data.target > 140).astype(int) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) predictions = model.predict(X_test) print('Predictions:', predictions)
from sklearn.datasets import load_diabetes from sklearn.svm import SVC from sklearn.model_selection import train_test_split # Load and binarize target data = load_diabetes() X = data.data y = (data.target > 140).astype(int) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = SVC(kernel='linear') model.fit(X_train, y_train) predictions = model.predict(X_test) print('Predictions:', predictions)
Complexity: O(n * d * i) time, O(n * d) space
Time Complexity
Training logistic regression takes O(n * d * i) where n is samples, d is features, and i is iterations for convergence.
Space Complexity
Storing data and model weights requires O(n * d) space; model weights are O(d).
Which Approach is Fastest?
Logistic regression is fastest and simplest; Random Forest and SVM are slower but may improve accuracy.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Logistic Regression | O(n * d * i) | O(n * d) | Fast, interpretable binary classification |
| Random Forest | O(trees * n * d * log n) | O(n * d) | Complex patterns, better accuracy |
| SVM | O(n^2 * d) | O(n * d) | High-dimensional data, margin-based classification |