Distributed training basics in MLOps - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When training machine learning models across many machines, it is important to understand how the training time changes as we add more data or machines.
We want to know how the total work grows when we split tasks in distributed training.
Analyze the time complexity of the following distributed training loop.
for epoch in range(num_epochs):
for batch in data_batches:
distribute_batch_to_workers(batch)
workers_train_on_batch()
gather_results_from_workers()
update_model_parameters()
This code splits data into batches, sends each batch to workers, trains in parallel, then collects results to update the model.
Look for loops or repeated steps that take most time.
- Primary operation: Training on each batch by workers.
- How many times: Once per batch, repeated for all batches in all epochs.
As the number of batches grows, the total training time grows roughly in proportion.
| Input Size (n = batches) | Approx. Operations |
|---|---|
| 10 | 10 training steps per epoch |
| 100 | 100 training steps per epoch |
| 1000 | 1000 training steps per epoch |
Pattern observation: Doubling batches roughly doubles the training steps, so time grows linearly with data size.
Time Complexity: O(n)
This means training time grows linearly with the number of data batches processed.
[X] Wrong: "Adding more machines always makes training time go down proportionally."
[OK] Correct: Communication and coordination between machines add overhead, so time does not always shrink perfectly with more workers.
Understanding how training time scales with data and machines helps you explain real-world trade-offs in distributed machine learning.
"What if we increased the number of workers instead of batches? How would the time complexity change?"
Practice
Solution
Step 1: Understand distributed training goal
Distributed training is designed to share the training task among several machines or GPUs to speed up the process.Step 2: Analyze options
Only To split the training workload across multiple machines or GPUs correctly describes this purpose. Options A, B, and C do not relate to workload distribution.Final Answer:
To split the training workload across multiple machines or GPUs -> Option BQuick Check:
Distributed training = workload split [OK]
- Thinking distributed training reduces dataset size
- Confusing distributed training with hyperparameter tuning
- Believing distributed training avoids GPU use
Solution
Step 1: Identify correct function name
The correct function to initialize communication is torch.distributed.init_process_group.Step 2: Check syntax correctness
torch.distributed.init_process_group(backend='nccl', rank=0, world_size=1) uses the correct module and function name with proper parameters. Other options use incorrect function names or modules.Final Answer:
torch.distributed.init_process_group(backend='nccl', rank=0, world_size=1) -> Option AQuick Check:
Correct init function = torch.distributed.init_process_group(backend='nccl', rank=0, world_size=1) [OK]
- Using wrong function names like start_process_group
- Calling init_process_group from wrong module
- Misspelling function or module names
print(rank, world_size)?
import torch.distributed as dist rank = 2 world_size = 4 print(rank, world_size)
Solution
Step 1: Analyze variable assignments
Variables rank and world_size are assigned values 2 and 4 respectively before the print statement.Step 2: Understand print output
Printing rank and world_size will output '2 4' exactly as assigned.Final Answer:
2 4 -> Option DQuick Check:
Print rank, world_size = 2 4 [OK]
- Confusing rank with world_size order
- Assuming variables are undefined
- Expecting automatic values without assignment
import torch.distributed as dist dist.init_process_group(backend='nccl', rank=0)What is missing that causes the error?
Solution
Step 1: Check init_process_group parameters
The function requires both rank and world_size parameters to know the total number of processes.Step 2: Identify missing parameter
The code misses world_size, which causes the error.Final Answer:
The world_size parameter is missing -> Option CQuick Check:
Missing world_size causes error [OK]
- Omitting world_size parameter
- Using wrong backend names
- Passing rank as string instead of int
Solution
Step 1: Understand correct initialization order
dist.init_process_group must be called before calling dist.get_rank() or dist.get_world_size() to initialize communication.Step 2: Analyze each option
import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size) initializes the process group first, then gets rank and world size, then prints them. Other options either get rank before initialization or pass rank manually without initialization.Final Answer:
import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size) -> Option AQuick Check:
Init first, then get rank/world_size = import torch.distributed as dist dist.init_process_group(backend='nccl') rank = dist.get_rank() world_size = dist.get_world_size() print(rank, world_size) [OK]
- Calling get_rank before init_process_group
- Passing rank manually without init
- Not calling init_process_group at all
