How to Build Object Detection Models with PyTorch
To build object detection in
PyTorch, use a pre-trained model like Faster R-CNN from torchvision.models.detection, fine-tune it on your dataset, and train with images and bounding box labels. The model outputs bounding boxes and class labels for detected objects.Syntax
Here is the basic syntax to load a pre-trained Faster R-CNN model, prepare it for training, and perform a forward pass:
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True): Loads a pre-trained Faster R-CNN.model.train(): Sets the model to training mode.outputs = model(images, targets): Runs images and targets through the model during training.outputs = model(images): Runs images through the model during evaluation to get predictions.
python
import torch import torchvision # Load pre-trained Faster R-CNN model model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) # Set model to training mode model.train() # Example input: list of images (tensors) and targets (dicts with boxes and labels) images = [torch.rand(3, 224, 224)] # dummy image targets = [{ 'boxes': torch.tensor([[10., 20., 100., 200.]], dtype=torch.float32), 'labels': torch.tensor([1], dtype=torch.int64) }] # Forward pass during training outputs = model(images, targets) # Outputs is a dict with losses print(outputs)
Output
{'loss_classifier': tensor(...), 'loss_box_reg': tensor(...), 'loss_objectness': tensor(...), 'loss_rpn_box_reg': tensor(...)}
Example
This example shows how to fine-tune a Faster R-CNN model on a small custom dataset with one image and one bounding box. It demonstrates loading the model, preparing data, training for one step, and getting predictions.
python
import torch import torchvision from torchvision.transforms import functional as F # Load pre-trained Faster R-CNN model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) model.train() # Dummy image tensor (3 channels, 224x224) image = torch.rand(3, 224, 224) images = [image] # Dummy target with one bounding box and label targets = [{ 'boxes': torch.tensor([[50., 50., 150., 150.]], dtype=torch.float32), 'labels': torch.tensor([1], dtype=torch.int64) }] # Optimizer optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9) # Training step optimizer.zero_grad() losses = model(images, targets) loss = sum(loss for loss in losses.values()) loss.backward() optimizer.step() print(f"Training loss: {loss.item():.4f}") # Switch to eval mode for prediction model.eval() with torch.no_grad(): prediction = model(images) print("Prediction output:", prediction)
Output
Training loss: 1.2345
Prediction output: [{'boxes': tensor([[...]]), 'labels': tensor([...]), 'scores': tensor([...])}]
Common Pitfalls
- Incorrect input format: The model expects a list of images as tensors and a list of target dictionaries with keys
boxesandlabels. Passing single tensors or wrong keys causes errors. - Bounding box format: Boxes must be in
[x_min, y_min, x_max, y_max]format as floats. - Model mode: Use
model.train()for training andmodel.eval()for inference. - Device mismatch: Make sure images, targets, and model are on the same device (CPU or GPU).
python
import torch import torchvision model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True) model.eval() # Wrong: passing single tensor instead of list image = torch.rand(3, 224, 224) try: output = model(image) # This will raise an error except Exception as e: print(f"Error: {e}") # Right: pass list of images output = model([image]) print("Output keys:", output[0].keys())
Output
Error: Expected list of tensors as input
Output keys: dict_keys(['boxes', 'labels', 'scores'])
Quick Reference
Key points for building object detection with PyTorch:
- Use
torchvision.models.detectionfor pre-built models like Faster R-CNN. - Input images as list of tensors, targets as list of dicts with
boxesandlabels. - Set model mode correctly:
train()oreval(). - Use optimizer and backpropagation for training.
- Outputs during training are loss dicts; during eval are prediction dicts.
Key Takeaways
Use torchvision's pre-trained Faster R-CNN model for easy object detection setup.
Always pass images as a list of tensors and targets as a list of dicts with boxes and labels.
Switch between model.train() for training and model.eval() for inference.
Ensure bounding boxes are float tensors in [x_min, y_min, x_max, y_max] format.
Use optimizer and backpropagation to fine-tune the model on your dataset.