How to build object detection pytorch

PytorchHow-ToBeginner · 4 min read

How to Build Object Detection Models with PyTorch

To build object detection in PyTorch, use a pre-trained model like Faster R-CNN from torchvision.models.detection, fine-tune it on your dataset, and train with images and bounding box labels. The model outputs bounding boxes and class labels for detected objects.

📐

Syntax

Here is the basic syntax to load a pre-trained Faster R-CNN model, prepare it for training, and perform a forward pass:

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True): Loads a pre-trained Faster R-CNN.
model.train(): Sets the model to training mode.
outputs = model(images, targets): Runs images and targets through the model during training.
outputs = model(images): Runs images through the model during evaluation to get predictions.

python

import torch
import torchvision

# Load pre-trained Faster R-CNN model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)

# Set model to training mode
model.train()

# Example input: list of images (tensors) and targets (dicts with boxes and labels)
images = [torch.rand(3, 224, 224)]  # dummy image
targets = [{
    'boxes': torch.tensor([[10., 20., 100., 200.]], dtype=torch.float32),
    'labels': torch.tensor([1], dtype=torch.int64)
}]

# Forward pass during training
outputs = model(images, targets)

# Outputs is a dict with losses
print(outputs)

Output

{'loss_classifier': tensor(...), 'loss_box_reg': tensor(...), 'loss_objectness': tensor(...), 'loss_rpn_box_reg': tensor(...)}

💻

Example

This example shows how to fine-tune a Faster R-CNN model on a small custom dataset with one image and one bounding box. It demonstrates loading the model, preparing data, training for one step, and getting predictions.

python

import torch
import torchvision
from torchvision.transforms import functional as F

# Load pre-trained Faster R-CNN
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.train()

# Dummy image tensor (3 channels, 224x224)
image = torch.rand(3, 224, 224)
images = [image]

# Dummy target with one bounding box and label
targets = [{
    'boxes': torch.tensor([[50., 50., 150., 150.]], dtype=torch.float32),
    'labels': torch.tensor([1], dtype=torch.int64)
}]

# Optimizer
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9)

# Training step
optimizer.zero_grad()
losses = model(images, targets)
loss = sum(loss for loss in losses.values())
loss.backward()
optimizer.step()

print(f"Training loss: {loss.item():.4f}")

# Switch to eval mode for prediction
model.eval()
with torch.no_grad():
    prediction = model(images)

print("Prediction output:", prediction)

Output

Training loss: 1.2345 Prediction output: [{'boxes': tensor([[...]]), 'labels': tensor([...]), 'scores': tensor([...])}]

⚠️

Common Pitfalls

Incorrect input format: The model expects a list of images as tensors and a list of target dictionaries with keys boxes and labels. Passing single tensors or wrong keys causes errors.
Bounding box format: Boxes must be in [x_min, y_min, x_max, y_max] format as floats.
Model mode: Use model.train() for training and model.eval() for inference.
Device mismatch: Make sure images, targets, and model are on the same device (CPU or GPU).

python

import torch
import torchvision

model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.eval()

# Wrong: passing single tensor instead of list
image = torch.rand(3, 224, 224)
try:
    output = model(image)  # This will raise an error
except Exception as e:
    print(f"Error: {e}")

# Right: pass list of images
output = model([image])
print("Output keys:", output[0].keys())

Output

Error: Expected list of tensors as input Output keys: dict_keys(['boxes', 'labels', 'scores'])

📊

Quick Reference

Key points for building object detection with PyTorch:

Use torchvision.models.detection for pre-built models like Faster R-CNN.
Input images as list of tensors, targets as list of dicts with boxes and labels.
Set model mode correctly: train() or eval().
Use optimizer and backpropagation for training.
Outputs during training are loss dicts; during eval are prediction dicts.

✅

Key Takeaways

Use torchvision's pre-trained Faster R-CNN model for easy object detection setup.

Always pass images as a list of tensors and targets as a list of dicts with boxes and labels.

Switch between model.train() for training and model.eval() for inference.

Ensure bounding boxes are float tensors in [x_min, y_min, x_max, y_max] format.

Use optimizer and backpropagation to fine-tune the model on your dataset.