Python ML Program to Detect Fraud Using sklearn
model = LogisticRegression(), then fit with model.fit(X_train, y_train) and predict fraud with model.predict(X_test).Examples
How to Think About It
Algorithm
Code
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score # Sample data: [transaction amount, transaction type (0 or 1)] X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]] y = [0, 1, 0, 1, 0, 1] # 0 = not fraud, 1 = fraud # Split data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) # Create and train model model = LogisticRegression() model.fit(X_train, y_train) # Predict y_pred = model.predict(X_test) # Print accuracy print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}") print(f"Predictions: {y_pred.tolist()}")
Dry Run
Let's trace the sample data through the code to see how the model learns and predicts fraud.
Split data
Training data: [[150,1],[300,0],[50,1],[400,0]] with labels [0,1,0,1]; Test data: [[100,1],[200,0]] with labels [0,1]
Train model
Model learns from training data patterns between features and fraud labels.
Predict on test data
Model predicts labels for [[100,1],[200,0]] as [0,1]
Calculate accuracy
Compare predicted [0,1] with actual [0,1], accuracy is 100%
| Step | Data | Labels | Action | Result |
|---|---|---|---|---|
| 1 | Training: [[150,1],[300,0],[50,1],[400,0]] | [0,1,0,1] | Train model | Model learns patterns |
| 2 | Test: [[100,1],[200,0]] | [0,1] | Predict | Predicted [0,1] |
| 3 | Compare predictions | [0,1] | Calculate accuracy | Accuracy = 1.0 |
Why This Works
Step 1: Data preparation
We prepare features and labels so the model knows what to learn and what to predict.
Step 2: Model training
The Logistic Regression model finds patterns in the training data to separate fraud from non-fraud.
Step 3: Prediction and evaluation
The model predicts fraud on new data and we check accuracy to see how well it learned.
Alternative Approaches
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]] y = [0, 1, 0, 1, 0, 1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) model = RandomForestClassifier(random_state=42) model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}") print(f"Predictions: {y_pred.tolist()}")
from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]] y = [0, 1, 0, 1, 0, 1] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) model = SVC(kernel='linear') model.fit(X_train, y_train) y_pred = model.predict(X_test) print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}") print(f"Predictions: {y_pred.tolist()}")
Complexity: O(n * d) time, O(n * d) space
Time Complexity
Training Logistic Regression takes time proportional to the number of samples (n) times the number of features (d). Prediction is faster but also depends on d.
Space Complexity
The model stores weights for each feature, so space grows with the number of features and samples during training.
Which Approach is Fastest?
Logistic Regression is faster and simpler than Random Forest and SVM, making it good for quick fraud detection on moderate data.
| Approach | Time | Space | Best For |
|---|---|---|---|
| Logistic Regression | O(n*d) | O(d) | Fast, interpretable fraud detection |
| Random Forest | O(t*n*log n) | O(t*n) | Complex patterns, higher accuracy |
| SVM | O(n^2*d) | O(n*d) | Clear margin separation, smaller datasets |