Distributed training basics in MLOps - Time & Space Complexity
When training machine learning models across many machines, it is important to understand how the training time changes as we add more data or machines.
We want to know how the total work grows when we split tasks in distributed training.
Analyze the time complexity of the following distributed training loop.
for epoch in range(num_epochs):
for batch in data_batches:
distribute_batch_to_workers(batch)
workers_train_on_batch()
gather_results_from_workers()
update_model_parameters()
This code splits data into batches, sends each batch to workers, trains in parallel, then collects results to update the model.
Look for loops or repeated steps that take most time.
- Primary operation: Training on each batch by workers.
- How many times: Once per batch, repeated for all batches in all epochs.
As the number of batches grows, the total training time grows roughly in proportion.
| Input Size (n = batches) | Approx. Operations |
|---|---|
| 10 | 10 training steps per epoch |
| 100 | 100 training steps per epoch |
| 1000 | 1000 training steps per epoch |
Pattern observation: Doubling batches roughly doubles the training steps, so time grows linearly with data size.
Time Complexity: O(n)
This means training time grows linearly with the number of data batches processed.
[X] Wrong: "Adding more machines always makes training time go down proportionally."
[OK] Correct: Communication and coordination between machines add overhead, so time does not always shrink perfectly with more workers.
Understanding how training time scales with data and machines helps you explain real-world trade-offs in distributed machine learning.
"What if we increased the number of workers instead of batches? How would the time complexity change?"