0
0
MlopsProgramBeginner · 2 min read

Python ML Program to Detect Fraud Using sklearn

Use sklearn's LogisticRegression to train a model on transaction data with labels for fraud, like model = LogisticRegression(), then fit with model.fit(X_train, y_train) and predict fraud with model.predict(X_test).
📋

Examples

Input[[100, 1], [200, 0], [150, 1]] (features), [0, 1, 0] (labels)
Output[0, 1, 0] (predicted fraud labels)
Input[[500, 0], [20, 1], [300, 0]] (features), [1, 0, 1] (labels)
Output[1, 0, 1] (predicted fraud labels)
Input[[0, 0], [0, 0], [0, 0]] (features), [0, 0, 0] (labels)
Output[0, 0, 0] (predicted fraud labels)
🧠

How to Think About It

To detect fraud, first collect transaction data with features like amount and type, and labels showing if each is fraud or not. Then split data into training and testing sets. Train a simple model like Logistic Regression on training data to learn patterns. Finally, test the model on new data to predict fraud.
📐

Algorithm

1
Collect and prepare transaction data with features and fraud labels
2
Split data into training and testing sets
3
Create a Logistic Regression model
4
Train the model on training data
5
Use the trained model to predict fraud on test data
6
Evaluate prediction accuracy
💻

Code

sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data: [transaction amount, transaction type (0 or 1)]
X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]]
y = [0, 1, 0, 1, 0, 1]  # 0 = not fraud, 1 = fraud

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Predictions: {y_pred.tolist()}")
Output
Accuracy: 1.00 Predictions: [0, 1]
🔍

Dry Run

Let's trace the sample data through the code to see how the model learns and predicts fraud.

1

Split data

Training data: [[150,1],[300,0],[50,1],[400,0]] with labels [0,1,0,1]; Test data: [[100,1],[200,0]] with labels [0,1]

2

Train model

Model learns from training data patterns between features and fraud labels.

3

Predict on test data

Model predicts labels for [[100,1],[200,0]] as [0,1]

4

Calculate accuracy

Compare predicted [0,1] with actual [0,1], accuracy is 100%

StepDataLabelsActionResult
1Training: [[150,1],[300,0],[50,1],[400,0]][0,1,0,1]Train modelModel learns patterns
2Test: [[100,1],[200,0]][0,1]PredictPredicted [0,1]
3Compare predictions[0,1]Calculate accuracyAccuracy = 1.0
💡

Why This Works

Step 1: Data preparation

We prepare features and labels so the model knows what to learn and what to predict.

Step 2: Model training

The Logistic Regression model finds patterns in the training data to separate fraud from non-fraud.

Step 3: Prediction and evaluation

The model predicts fraud on new data and we check accuracy to see how well it learned.

🔄

Alternative Approaches

Random Forest Classifier
sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]]
y = [0, 1, 0, 1, 0, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Predictions: {y_pred.tolist()}")
Random Forest can capture more complex patterns but is slower and less interpretable than Logistic Regression.
Support Vector Machine (SVM)
sklearn
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]]
y = [0, 1, 0, 1, 0, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = SVC(kernel='linear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Predictions: {y_pred.tolist()}")
SVM works well for clear margin separation but can be slower on large datasets.

Complexity: O(n * d) time, O(n * d) space

Time Complexity

Training Logistic Regression takes time proportional to the number of samples (n) times the number of features (d). Prediction is faster but also depends on d.

Space Complexity

The model stores weights for each feature, so space grows with the number of features and samples during training.

Which Approach is Fastest?

Logistic Regression is faster and simpler than Random Forest and SVM, making it good for quick fraud detection on moderate data.

ApproachTimeSpaceBest For
Logistic RegressionO(n*d)O(d)Fast, interpretable fraud detection
Random ForestO(t*n*log n)O(t*n)Complex patterns, higher accuracy
SVMO(n^2*d)O(n*d)Clear margin separation, smaller datasets
💡
Always split your data into training and testing sets to fairly evaluate your fraud detection model.
⚠️
A common mistake is training and testing on the same data, which gives overly optimistic accuracy.