Kubernetes for ML workloads in MLOps - Time & Space Complexity
Start learning this pattern below
Jump into concepts and practice - no test required
When running machine learning tasks on Kubernetes, it is important to understand how the time to complete jobs grows as the workload size increases.
We want to know how the system handles more data or more tasks and how that affects execution time.
Analyze the time complexity of the following Kubernetes job submission code for ML workloads.
for job in ml_jobs:
kubectl apply -f job.yaml --record
wait_for_job_completion(job)
This code submits multiple ML jobs to Kubernetes one after another and waits for each to finish before starting the next.
Look at what repeats in this code.
- Primary operation: Submitting and waiting for each ML job to complete.
- How many times: Once for each job in the list.
As the number of ML jobs increases, the total time grows roughly in direct proportion.
| Input Size (n) | Approx. Operations |
|---|---|
| 10 | 10 job submissions and waits |
| 100 | 100 job submissions and waits |
| 1000 | 1000 job submissions and waits |
Pattern observation: Doubling the number of jobs roughly doubles the total time because jobs run one after another.
Time Complexity: O(n)
This means the total time grows linearly with the number of ML jobs submitted.
[X] Wrong: "Submitting jobs one by one is always faster because it avoids overload."
[OK] Correct: Running jobs sequentially means waiting for each to finish before starting the next, which adds up time linearly instead of running jobs in parallel to save time.
Understanding how job submission scales helps you design better ML pipelines on Kubernetes and shows you can think about system efficiency clearly.
"What if we submitted all ML jobs at once without waiting? How would the time complexity change?"
Practice
Solution
Step 1: Understand Kubernetes resource types
Jobs are designed to run tasks that complete once, like ML training.Step 2: Match resource to ML training task
Since training is a one-time batch task, Job is the correct resource.Final Answer:
Job -> Option AQuick Check:
One-time ML training = Job [OK]
- Choosing Deployment which is for long-running services
- Confusing Service with workload resource
- Using ConfigMap which stores config data only
Solution
Step 1: Identify GPU resource naming in Kubernetes
GPUs are requested using the vendor-specific resource name like nvidia.com/gpu.Step 2: Check correct YAML structure for limits
GPUs are usually set under limits, not requests, with the correct key.Final Answer:
resources: limits: nvidia.com/gpu: 2 -> Option CQuick Check:
GPU request uses nvidia.com/gpu under limits [OK]
- Using 'gpu' instead of 'nvidia.com/gpu'
- Placing GPU under requests instead of limits
- Confusing CPU or memory keys with GPU
apiVersion: batch/v1
kind: Job
metadata:
name: ml-train
spec:
template:
spec:
containers:
- name: trainer
image: ml-image:latest
command: ["python", "train.py"]
restartPolicy: Never
backoffLimit: 3
Solution
Step 1: Understand Job behavior with backoffLimit
The backoffLimit sets how many retries happen on failure before Job stops.Step 2: Check restartPolicy and command
restartPolicy: Never means pods won't restart automatically; Job controller retries pods.Final Answer:
The Job runs the training once and retries up to 3 times on failure -> Option AQuick Check:
Job with backoffLimit retries 3 times [OK]
- Thinking Job runs continuously like Deployment
- Assuming restartPolicy: Never causes immediate failure
- Confusing Job with Deployment resource
Solution
Step 1: Analyze pod restart reasons
Pods restarting often means container crashes, commonly due to bad image or command.Step 2: Check other options relevance
Missing replicas defaults to 1, Service exposure doesn't cause restarts, ConfigMap missing causes config errors but not always restarts.Final Answer:
The container image is missing or incorrect -> Option DQuick Check:
Pod restarts usually mean bad container image [OK]
- Assuming missing replicas causes restarts
- Confusing Service exposure with pod health
- Thinking ConfigMap absence always crashes pods
Solution
Step 1: Identify resource for long-running model serving
Deployment manages long-running pods and supports updates.Step 2: Choose scaling feature for CPU-based autoscaling
Horizontal Pod Autoscaler (HPA) automatically adjusts pod count based on CPU usage.Final Answer:
Deployment with Horizontal Pod Autoscaler (HPA) -> Option BQuick Check:
Use Deployment + HPA for scalable model serving [OK]
- Using Job which is for batch tasks, not serving
- Choosing StatefulSet which is for stateful apps
- DaemonSet runs pods on all nodes, not for scaling
