GPU vs CPU inference tradeoffs in MLOps - Performance Comparison
Start learning this pattern below
Jump into concepts and practice - no test required
When running machine learning models, choosing between GPU and CPU affects how fast predictions happen.
We want to understand how the time to get results changes as the input size grows on each device.
Analyze the time complexity of this inference code snippet.
for batch in data_loader:
inputs = batch.to(device) # device is 'cpu' or 'gpu'
outputs = model(inputs) # run inference
results.append(outputs.cpu())
# data_loader yields batches of size b
# total data size is n
This code runs inference on batches of data either on CPU or GPU and collects results.
Identify the loops, recursion, array traversals that repeat.
- Primary operation: Loop over batches to run model inference.
- How many times: Approximately n/b times, where n is total data size and b is batch size.
As input size n grows, the number of batches grows roughly proportionally, so total inference time grows too.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | ~10/b batches, fast inference |
| 100 | ~100/b batches, moderate inference time |
| 1000 | ~1000/b batches, longer inference time |
Pattern observation: Total time grows roughly linearly with input size n.
Time Complexity: O(n)
This means inference time grows in direct proportion to how much data you process.
[X] Wrong: "GPU inference always runs in constant time regardless of input size."
[OK] Correct: GPU speeds up parallel work but still processes all data, so time grows with input size.
Understanding how inference time scales helps you explain tradeoffs in real projects and shows you grasp performance basics.
"What if we increase batch size b significantly? How would the time complexity change or stay the same?"
Practice
Solution
Step 1: Understand GPU design for parallelism
GPUs have many cores designed to handle many operations at once, making them faster for large data batches.Step 2: Compare CPU and GPU strengths
CPUs are better for single-threaded or small tasks, but GPUs excel at parallel processing, speeding up large inference jobs.Final Answer:
Faster processing for large batches of data -> Option DQuick Check:
GPU parallelism = Faster large batch inference [OK]
- Thinking GPUs always use less power
- Assuming CPUs are cheaper for large-scale inference
- Confusing single-threaded speed with parallel speed
Solution
Step 1: Understand CUDA_VISIBLE_DEVICES usage
Setting CUDA_VISIBLE_DEVICES to an empty string disables GPU visibility, forcing CPU usage.Step 2: Check each option's effect
CUDA_VISIBLE_DEVICES='' python inference.py disables GPUs correctly; others either select GPUs or use invalid values.Final Answer:
CUDA_VISIBLE_DEVICES='' python inference.py -> Option BQuick Check:
Empty CUDA_VISIBLE_DEVICES disables GPU [OK]
- Using 0 disables only GPU 0, not all GPUs
- Using -1 is invalid for CUDA_VISIBLE_DEVICES
- Assuming 'all' enables all GPUs but not disables
import time start = time.time() # Run model inference here end = time.time() print(round(end - start, 2))
If GPU inference takes 0.05 seconds and CPU inference takes 0.5 seconds, what will be printed when running on CPU?
Solution
Step 1: Understand timing code output
The code prints the elapsed time rounded to 2 decimals, so it shows seconds taken.Step 2: Match CPU inference time to output
CPU inference takes 0.5 seconds, so the printed output is 0.5.Final Answer:
0.5 -> Option CQuick Check:
CPU time = 0.5 seconds printed [OK]
- Confusing milliseconds with seconds
- Choosing GPU time instead of CPU time
- Misreading rounding precision
Solution
Step 1: Identify GPU performance factors
GPUs perform best with larger batch sizes to utilize many cores efficiently.Step 2: Evaluate options for improving GPU speed
Increasing batch size improves GPU throughput; reducing batch size or disabling GPU lowers performance.Final Answer:
Increase batch size to better use GPU parallelism -> Option AQuick Check:
GPU speed improves with larger batches [OK]
- Thinking smaller batches speed up GPU
- Disabling GPU to fix GPU slowness
- Using single-thread CPU instead of GPU
Solution
Step 1: Analyze model size and input volume impact
Small models and low input do not benefit much from GPU parallelism, so GPU cost is less justified.Step 2: Consider budget and batch size tradeoffs
CPU inference with small batches reduces cost and matches low volume needs without GPU overhead.Final Answer:
Use CPU inference with small batch sizes -> Option AQuick Check:
Small model + low volume + budget = CPU small batch [OK]
- Choosing GPU despite low volume and budget
- Using large batches on CPU causing delays
- Ignoring cost when selecting GPU
