Data parallelism vs model parallelism in MLOps - Performance Comparison
When training machine learning models, we often split work to speed things up. This can be done by splitting data or splitting the model itself.
We want to understand how the time to train changes as we increase data or model size using these two methods.
Analyze the time complexity of these simplified parallel training steps.
for each batch in data_batches: # data parallelism
send batch to each worker
worker computes forward and backward pass
gather gradients and update model
# model parallelism example
split model into parts
for each input batch:
pass data through model parts sequentially on different devices
compute gradients and update parts
This code shows two ways to split training: by data batches or by model parts.
Look at what repeats and costs time:
- Primary operation: Forward and backward passes over data or model parts.
- How many times: For data parallelism, once per data batch per worker; for model parallelism, once per model part sequentially per batch.
As data size grows, data parallelism splits batches across workers, so time per batch stays similar but total work grows linearly.
| Input Size (n batches) | Approx. Operations |
|---|---|
| 10 | 10 forward/backward passes split across workers |
| 100 | 100 forward/backward passes split across workers |
| 1000 | 1000 forward/backward passes split across workers |
For model parallelism, as model size grows, the number of sequential parts grows, increasing time per batch roughly linearly with model parts.
Time Complexity: O(n)
This means training time grows roughly in direct proportion to the number of data batches or model parts processed.
[X] Wrong: "Splitting data or model always makes training twice as fast when doubling workers or parts."
[OK] Correct: Communication overhead and sequential steps in model parallelism limit speed gains, so doubling resources does not always halve time.
Understanding how splitting work affects training time helps you explain trade-offs in real projects. It shows you can think about scaling and efficiency clearly.
What if we combined data and model parallelism? How would the time complexity change?