MlopsProgramBeginner · 2 min read

Python ML Program to Detect Fraud Using sklearn

Use sklearn's LogisticRegression to train a model on transaction data with labels for fraud, like model = LogisticRegression(), then fit with model.fit(X_train, y_train) and predict fraud with model.predict(X_test).

📋

Examples

Input[[100, 1], [200, 0], [150, 1]] (features), [0, 1, 0] (labels)

Output[0, 1, 0] (predicted fraud labels)

Input[[500, 0], [20, 1], [300, 0]] (features), [1, 0, 1] (labels)

Output[1, 0, 1] (predicted fraud labels)

Input[[0, 0], [0, 0], [0, 0]] (features), [0, 0, 0] (labels)

Output[0, 0, 0] (predicted fraud labels)

🧠

How to Think About It

To detect fraud, first collect transaction data with features like amount and type, and labels showing if each is fraud or not. Then split data into training and testing sets. Train a simple model like Logistic Regression on training data to learn patterns. Finally, test the model on new data to predict fraud.

📐

Algorithm

Collect and prepare transaction data with features and fraud labels

Split data into training and testing sets

Create a Logistic Regression model

Train the model on training data

Use the trained model to predict fraud on test data

Evaluate prediction accuracy

💻

Code

sklearn

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data: [transaction amount, transaction type (0 or 1)]
X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]]
y = [0, 1, 0, 1, 0, 1]  # 0 = not fraud, 1 = fraud

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Create and train model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Print accuracy
print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Predictions: {y_pred.tolist()}")

Output

Accuracy: 1.00 Predictions: [0, 1]

🔍

Dry Run

Let's trace the sample data through the code to see how the model learns and predicts fraud.

Split data

Training data: [[150,1],[300,0],[50,1],[400,0]] with labels [0,1,0,1]; Test data: [[100,1],[200,0]] with labels [0,1]

Train model

Model learns from training data patterns between features and fraud labels.

Predict on test data

Model predicts labels for [[100,1],[200,0]] as [0,1]

Calculate accuracy

Compare predicted [0,1] with actual [0,1], accuracy is 100%

Step	Data	Labels	Action	Result
1	Training: [[150,1],[300,0],[50,1],[400,0]]	[0,1,0,1]	Train model	Model learns patterns
2	Test: [[100,1],[200,0]]	[0,1]	Predict	Predicted [0,1]
3	Compare predictions	[0,1]	Calculate accuracy	Accuracy = 1.0

💡

Why This Works

Step 1: Data preparation

We prepare features and labels so the model knows what to learn and what to predict.

Step 2: Model training

The Logistic Regression model finds patterns in the training data to separate fraud from non-fraud.

Step 3: Prediction and evaluation

The model predicts fraud on new data and we check accuracy to see how well it learned.

🔄

Alternative Approaches

Random Forest Classifier

sklearn

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]]
y = [0, 1, 0, 1, 0, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Predictions: {y_pred.tolist()}")

Random Forest can capture more complex patterns but is slower and less interpretable than Logistic Regression.

Support Vector Machine (SVM)

sklearn

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X = [[100, 1], [200, 0], [150, 1], [300, 0], [50, 1], [400, 0]]
y = [0, 1, 0, 1, 0, 1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = SVC(kernel='linear')
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.2f}")
print(f"Predictions: {y_pred.tolist()}")

SVM works well for clear margin separation but can be slower on large datasets.

⚡

Complexity: O(n * d) time, O(n * d) space

Time Complexity

Training Logistic Regression takes time proportional to the number of samples (n) times the number of features (d). Prediction is faster but also depends on d.

Space Complexity

The model stores weights for each feature, so space grows with the number of features and samples during training.

Which Approach is Fastest?

Logistic Regression is faster and simpler than Random Forest and SVM, making it good for quick fraud detection on moderate data.

Approach	Time	Space	Best For
Logistic Regression	O(n*d)	O(d)	Fast, interpretable fraud detection
Random Forest	O(tnlog n)	O(t*n)	Complex patterns, higher accuracy
SVM	O(n^2*d)	O(n*d)	Clear margin separation, smaller datasets

💡

Always split your data into training and testing sets to fairly evaluate your fraud detection model.

⚠️

A common mistake is training and testing on the same data, which gives overly optimistic accuracy.