MLOpsdevops~10 mins

Distributed training basics in MLOps - Commands & Configuration

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Training machine learning models on large datasets can take a long time on a single computer. Distributed training splits the work across multiple machines or processors to finish faster and handle bigger data.

When your dataset is too large to fit into one machine's memory.

When training a deep learning model takes hours or days on a single GPU.

When you want to speed up model training by using multiple GPUs or machines.

When you need to scale your training to handle more complex models.

When you want to improve resource usage by distributing workload efficiently.

Commands

This command starts distributed training using PyTorch's built-in launcher with 2 processes on one machine. Each process handles one GPU to train the model in parallel.

Terminal

python -m torch.distributed.launch --nproc_per_node=2 train.py

Expected OutputExpected

INFO: Distributed training initialized with 2 processes Epoch 1/10: loss=0.45 Epoch 2/10: loss=0.38 ... (training logs continue)

→

--nproc_per_node - Number of processes to launch on the current node, usually set to the number of GPUs.

This command runs one process of a distributed training job using the NCCL backend for GPU communication. It specifies the total number of processes (world_size), this process's rank, and the master node's address and port.

Terminal

python train.py --backend nccl --world_size 4 --rank 0 --master_addr 127.0.0.1 --master_port 29500

Expected OutputExpected

INFO: Process 0 initialized for distributed training Epoch 1/10: loss=0.46 Epoch 2/10: loss=0.39 ... (training logs continue)

→

--backend - Communication backend for distributed training, NCCL is optimized for GPUs.

→

--world_size - Total number of processes participating in training.

→

--rank - Unique ID of this process among all processes.

→

--master_addr - IP address of the master node coordinating training.

→

--master_port - Port on the master node for communication.

Key Concept

If you remember nothing else from distributed training, remember: splitting training across multiple processes or machines lets you handle bigger data and train faster by working together.

Code Example

MLOps

import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
from torch.nn.parallel import DistributedDataParallel as DDP
import os

def setup():
    dist.init_process_group(backend='nccl', init_method='env://')
    torch.cuda.set_device(int(os.environ['LOCAL_RANK']))

def cleanup():
    dist.destroy_process_group()

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.linear = nn.Linear(10, 1)
    def forward(self, x):
        return self.linear(x)


def train():
    setup()
    model = SimpleModel().cuda()
    ddp_model = DDP(model, device_ids=[int(os.environ['LOCAL_RANK'])])
    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    for epoch in range(2):
        inputs = torch.randn(20, 10).cuda()
        labels = torch.randn(20, 1).cuda()
        optimizer.zero_grad()
        outputs = ddp_model(inputs)
        loss = loss_fn(outputs, labels)
        loss.backward()
        optimizer.step()
        print(f"Epoch {epoch+1}: loss={loss.item():.4f}")

    cleanup()

if __name__ == '__main__':
    train()

OutputSuccess

Common Mistakes

Not setting the correct world_size or rank for each process.

Processes won't coordinate properly, causing errors or hanging training.

Ensure each process has a unique rank and the total world_size matches the number of processes.

Using the wrong backend for the hardware (e.g., using NCCL on CPUs).

Communication will fail or be very slow because the backend is not compatible.

Use NCCL for GPUs and Gloo for CPUs.

Not launching the right number of processes matching available GPUs.

Some GPUs remain idle or processes compete for the same GPU, reducing efficiency.

Set nproc_per_node to the number of GPUs on the machine.

Summary

Use torch.distributed.launch or environment variables to start multiple training processes for distributed training.

Set backend, world_size, rank, master address, and port correctly to enable communication between processes.

DistributedDataParallel wraps your model to synchronize gradients and speed up training across GPUs or machines.

Practice

(1/5)

1. What is the main purpose of distributed training in machine learning?

easy

A. To avoid using GPUs during training

B. To split the training workload across multiple machines or GPUs

C. To increase the learning rate automatically

D. To reduce the size of the training dataset

5. In a distributed training setup with 4 GPUs, you want each process to know its rank and the total number of processes. Which code snippet correctly sets this up and prints the rank and world size?

hard

A. import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size)

B. import torch.distributed as dist world_size = 4 rank = dist.get_rank() dist.init_process_group(backend='nccl', rank=rank, world_size=world_size) print(rank, world_size)

C. import torch.distributed as dist rank = 0 world_size = 4 dist.init_process_group(backend='nccl', rank=rank, world_size=world_size) print(rank, world_size)

D. import torch.distributed as dist rank = dist.get_rank() world_size = dist.get_world_size() dist.init_process_group(backend='nccl', rank=rank, world_size=world_size) print(rank, world_size)

Distributed training basics in MLOps - Commands & Configuration

Start learning this pattern below

Practice

Solution

Step 1: Understand distributed training goal

Step 2: Analyze options

Final Answer:

Quick Check:

Solution

Step 1: Identify correct function name

Step 2: Check syntax correctness

Final Answer:

Quick Check:

Solution

Step 1: Analyze variable assignments

Step 2: Understand print output

Final Answer:

Quick Check:

Solution

Step 1: Check init_process_group parameters

Step 2: Identify missing parameter

Final Answer:

Quick Check:

Solution

Step 1: Understand correct initialization order

Step 2: Analyze each option

Final Answer:

Quick Check: