Jump into concepts and practice - no test required
or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Kubernetes for ML Workloads
📖 Scenario: You are a data scientist who wants to run a machine learning training job on Kubernetes. You will create a simple Kubernetes Pod configuration to run a Python script that trains a model. This project will guide you step-by-step to create the YAML configuration, add resource limits, and finally deploy and check the Pod status.
🎯 Goal: Build a Kubernetes Pod YAML file to run a machine learning training script, add resource limits, and deploy it to see the Pod running.
📋 What You'll Learn
Create a basic Kubernetes Pod YAML file named ml-training-pod.yaml with a container running Python
Add resource limits for CPU and memory to the container
Deploy the Pod using kubectl apply
Check the Pod status using kubectl get pods
💡 Why This Matters
🌍 Real World
Data scientists and ML engineers use Kubernetes to run training jobs reliably and scale them easily in production environments.
💼 Career
Knowing how to configure and deploy ML workloads on Kubernetes is a key skill for MLOps engineers and DevOps professionals working with AI projects.
Progress0 / 4 steps
1
Create the basic Pod YAML
Create a file named ml-training-pod.yaml with a Kubernetes Pod configuration. The Pod should be named ml-training-pod and run a container named ml-container using the image python:3.12-slim. The container should run the command python with arguments -c and print('Training started').
MLOps
Hint
Remember to use YAML indentation carefully. The container spec goes under spec.containers.
2
Add resource limits to the container
In the existing ml-training-pod.yaml file, add resource limits to the container ml-container. Set the CPU limit to 500m and memory limit to 256Mi under resources.limits.
MLOps
Hint
Resource limits go under the container spec with indentation. Use quotes around values like "500m".
3
Deploy the Pod to Kubernetes
Use the command kubectl apply -f ml-training-pod.yaml to deploy the Pod to your Kubernetes cluster.
MLOps
Hint
This command tells Kubernetes to create or update resources defined in the YAML file.
4
Check the Pod status
Use the command kubectl get pods to check the status of the Pod named ml-training-pod. The output should show the Pod with status Running or Completed.
MLOps
Hint
Look for the Pod name ml-training-pod in the list and check its STATUS column.
Practice
(1/5)
1. What is the primary Kubernetes resource used to run a one-time ML training task?
easy
A. Job
B. Deployment
C. Service
D. ConfigMap
Solution
Step 1: Understand Kubernetes resource types
Jobs are designed to run tasks that complete once, like ML training.
Step 2: Match resource to ML training task
Since training is a one-time batch task, Job is the correct resource.
Final Answer:
Job -> Option A
Quick Check:
One-time ML training = Job [OK]
Hint: Use Job for one-time tasks like training [OK]
Common Mistakes:
Choosing Deployment which is for long-running services
Confusing Service with workload resource
Using ConfigMap which stores config data only
2. Which of the following is the correct YAML snippet to request 2 GPUs in a Kubernetes pod spec?
easy
A. resources:
requests:
cpu: 2
B. resources:
limits:
memory: 2Gi
C. resources:
limits:
nvidia.com/gpu: 2
D. resources:
requests:
gpu: 2
Solution
Step 1: Identify GPU resource naming in Kubernetes
GPUs are requested using the vendor-specific resource name like nvidia.com/gpu.
Step 2: Check correct YAML structure for limits
GPUs are usually set under limits, not requests, with the correct key.
Final Answer:
resources:
limits:
nvidia.com/gpu: 2 -> Option C
Quick Check:
GPU request uses nvidia.com/gpu under limits [OK]
Hint: GPU requests use 'limits' with 'nvidia.com/gpu' key [OK]
Common Mistakes:
Using 'gpu' instead of 'nvidia.com/gpu'
Placing GPU under requests instead of limits
Confusing CPU or memory keys with GPU
3. Given this Kubernetes Job YAML snippet, what will happen when applied?
A. The Job runs the training once and retries up to 3 times on failure
B. The Job runs continuously without stopping
C. The Job will fail immediately due to missing restartPolicy
D. The Job creates a Deployment instead of a batch task
Solution
Step 1: Understand Job behavior with backoffLimit
The backoffLimit sets how many retries happen on failure before Job stops.
Step 2: Check restartPolicy and command
restartPolicy: Never means pods won't restart automatically; Job controller retries pods.
Final Answer:
The Job runs the training once and retries up to 3 times on failure -> Option A
Quick Check:
Job with backoffLimit retries 3 times [OK]
Hint: backoffLimit controls retry count for Job failures [OK]
Common Mistakes:
Thinking Job runs continuously like Deployment
Assuming restartPolicy: Never causes immediate failure
Confusing Job with Deployment resource
4. You deployed an ML model with a Deployment but the pods keep restarting. Which is the most likely cause?
medium
A. The ConfigMap is not mounted
B. The Deployment spec is missing replicas field
C. The Service is not exposing the Deployment
D. The container image is missing or incorrect
Solution
Step 1: Analyze pod restart reasons
Pods restarting often means container crashes, commonly due to bad image or command.
Step 2: Check other options relevance
Missing replicas defaults to 1, Service exposure doesn't cause restarts, ConfigMap missing causes config errors but not always restarts.
Final Answer:
The container image is missing or incorrect -> Option D
Quick Check:
Pod restarts usually mean bad container image [OK]
Hint: Pod restarts often mean container image or command error [OK]
Common Mistakes:
Assuming missing replicas causes restarts
Confusing Service exposure with pod health
Thinking ConfigMap absence always crashes pods
5. You want to deploy an ML model serving system that automatically scales based on CPU usage. Which Kubernetes resource and feature combination is best?
hard
A. DaemonSet to run one pod per node
B. Deployment with Horizontal Pod Autoscaler (HPA)
C. StatefulSet with persistent volumes
D. Job with backoffLimit set to 5
Solution
Step 1: Identify resource for long-running model serving
Deployment manages long-running pods and supports updates.
Step 2: Choose scaling feature for CPU-based autoscaling
Horizontal Pod Autoscaler (HPA) automatically adjusts pod count based on CPU usage.
Final Answer:
Deployment with Horizontal Pod Autoscaler (HPA) -> Option B
Quick Check:
Use Deployment + HPA for scalable model serving [OK]
Hint: Use Deployment + HPA for auto-scaling model serving [OK]