Bird
Raised Fist0
MLOpsdevops~5 mins

Kubernetes for ML workloads in MLOps - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is Kubernetes in the context of ML workloads?
Kubernetes is a system that helps run and manage machine learning tasks on many computers. It makes sure ML programs run smoothly and can grow or shrink as needed.
Click to reveal answer
beginner
Why use Kubernetes for machine learning model training?
Kubernetes helps by automatically managing resources, running training jobs in containers, and scaling up or down based on need. This saves time and avoids manual setup.
Click to reveal answer
beginner
What is a Pod in Kubernetes?
A Pod is the smallest unit in Kubernetes. It holds one or more containers that run ML code or services together on the same machine.
Click to reveal answer
intermediate
How does Kubernetes help with scaling ML workloads?
Kubernetes can add or remove Pods automatically based on how busy the ML workload is. This means your ML tasks get more power when needed and save resources when not busy.
Click to reveal answer
intermediate
What role do Persistent Volumes play in ML workloads on Kubernetes?
Persistent Volumes store data like training datasets or model files outside of Pods, so data stays safe even if Pods stop or restart.
Click to reveal answer
What does Kubernetes use to run ML code in isolated environments?
APhysical Servers
BVirtual Machines
CContainers
DDatabases
Which Kubernetes object is the smallest unit that runs containers?
ANode
BPod
CService
DDeployment
How does Kubernetes help when ML workloads need more computing power?
AIt automatically scales Pods up or down
BIt manually asks the user to add servers
CIt pauses the workload
DIt deletes old data
What is the purpose of Persistent Volumes in Kubernetes for ML?
ATo store data safely outside Pods
BTo monitor CPU usage
CTo create network connections
DTo run ML code faster
Which of these is NOT a benefit of using Kubernetes for ML workloads?
AIsolation of ML tasks in containers
BEasy scaling of workloads
CAutomatic resource management
DManual hardware setup required
Explain how Kubernetes manages machine learning workloads from running code to scaling resources.
Think about how Kubernetes runs and adjusts ML tasks automatically.
You got /4 concepts.
    Describe the role of storage in Kubernetes for ML workloads and why it is important.
    Consider what happens to data when ML tasks stop or restart.
    You got /3 concepts.

      Practice

      (1/5)
      1. What is the primary Kubernetes resource used to run a one-time ML training task?
      easy
      A. Job
      B. Deployment
      C. Service
      D. ConfigMap

      Solution

      1. Step 1: Understand Kubernetes resource types

        Jobs are designed to run tasks that complete once, like ML training.
      2. Step 2: Match resource to ML training task

        Since training is a one-time batch task, Job is the correct resource.
      3. Final Answer:

        Job -> Option A
      4. Quick Check:

        One-time ML training = Job [OK]
      Hint: Use Job for one-time tasks like training [OK]
      Common Mistakes:
      • Choosing Deployment which is for long-running services
      • Confusing Service with workload resource
      • Using ConfigMap which stores config data only
      2. Which of the following is the correct YAML snippet to request 2 GPUs in a Kubernetes pod spec?
      easy
      A. resources: requests: cpu: 2
      B. resources: limits: memory: 2Gi
      C. resources: limits: nvidia.com/gpu: 2
      D. resources: requests: gpu: 2

      Solution

      1. Step 1: Identify GPU resource naming in Kubernetes

        GPUs are requested using the vendor-specific resource name like nvidia.com/gpu.
      2. Step 2: Check correct YAML structure for limits

        GPUs are usually set under limits, not requests, with the correct key.
      3. Final Answer:

        resources: limits: nvidia.com/gpu: 2 -> Option C
      4. Quick Check:

        GPU request uses nvidia.com/gpu under limits [OK]
      Hint: GPU requests use 'limits' with 'nvidia.com/gpu' key [OK]
      Common Mistakes:
      • Using 'gpu' instead of 'nvidia.com/gpu'
      • Placing GPU under requests instead of limits
      • Confusing CPU or memory keys with GPU
      3. Given this Kubernetes Job YAML snippet, what will happen when applied?
      apiVersion: batch/v1
      kind: Job
      metadata:
        name: ml-train
      spec:
        template:
          spec:
            containers:
            - name: trainer
              image: ml-image:latest
              command: ["python", "train.py"]
            restartPolicy: Never
        backoffLimit: 3
      
      medium
      A. The Job runs the training once and retries up to 3 times on failure
      B. The Job runs continuously without stopping
      C. The Job will fail immediately due to missing restartPolicy
      D. The Job creates a Deployment instead of a batch task

      Solution

      1. Step 1: Understand Job behavior with backoffLimit

        The backoffLimit sets how many retries happen on failure before Job stops.
      2. Step 2: Check restartPolicy and command

        restartPolicy: Never means pods won't restart automatically; Job controller retries pods.
      3. Final Answer:

        The Job runs the training once and retries up to 3 times on failure -> Option A
      4. Quick Check:

        Job with backoffLimit retries 3 times [OK]
      Hint: backoffLimit controls retry count for Job failures [OK]
      Common Mistakes:
      • Thinking Job runs continuously like Deployment
      • Assuming restartPolicy: Never causes immediate failure
      • Confusing Job with Deployment resource
      4. You deployed an ML model with a Deployment but the pods keep restarting. Which is the most likely cause?
      medium
      A. The ConfigMap is not mounted
      B. The Deployment spec is missing replicas field
      C. The Service is not exposing the Deployment
      D. The container image is missing or incorrect

      Solution

      1. Step 1: Analyze pod restart reasons

        Pods restarting often means container crashes, commonly due to bad image or command.
      2. Step 2: Check other options relevance

        Missing replicas defaults to 1, Service exposure doesn't cause restarts, ConfigMap missing causes config errors but not always restarts.
      3. Final Answer:

        The container image is missing or incorrect -> Option D
      4. Quick Check:

        Pod restarts usually mean bad container image [OK]
      Hint: Pod restarts often mean container image or command error [OK]
      Common Mistakes:
      • Assuming missing replicas causes restarts
      • Confusing Service exposure with pod health
      • Thinking ConfigMap absence always crashes pods
      5. You want to deploy an ML model serving system that automatically scales based on CPU usage. Which Kubernetes resource and feature combination is best?
      hard
      A. DaemonSet to run one pod per node
      B. Deployment with Horizontal Pod Autoscaler (HPA)
      C. StatefulSet with persistent volumes
      D. Job with backoffLimit set to 5

      Solution

      1. Step 1: Identify resource for long-running model serving

        Deployment manages long-running pods and supports updates.
      2. Step 2: Choose scaling feature for CPU-based autoscaling

        Horizontal Pod Autoscaler (HPA) automatically adjusts pod count based on CPU usage.
      3. Final Answer:

        Deployment with Horizontal Pod Autoscaler (HPA) -> Option B
      4. Quick Check:

        Use Deployment + HPA for scalable model serving [OK]
      Hint: Use Deployment + HPA for auto-scaling model serving [OK]
      Common Mistakes:
      • Using Job which is for batch tasks, not serving
      • Choosing StatefulSet which is for stateful apps
      • DaemonSet runs pods on all nodes, not for scaling