Bird
Raised Fist0
MLOpsdevops~10 mins

Kubernetes for ML workloads in MLOps - Interactive Code Practice

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Practice - 5 Tasks
Answer the questions below
1fill in blank
easy

Complete the code to create a Kubernetes pod that runs a machine learning container.

MLOps
apiVersion: v1
kind: Pod
metadata:
  name: ml-pod
spec:
  containers:
  - name: ml-container
    image: [1]
Drag options to blanks, or click blank then click option'
Atensorflow/tensorflow:latest
Bubuntu:latest
Cnginx:alpine
Dmysql:5.7
Attempts:
3 left
💡 Hint
Common Mistakes
Using a web server image like nginx instead of an ML image.
Using a database image like mysql which is unrelated to ML workloads.
2fill in blank
medium

Complete the code to specify resource limits for the ML container in the pod.

MLOps
spec:
  containers:
  - name: ml-container
    resources:
      limits:
        cpu: [1]
Drag options to blanks, or click blank then click option'
A2Gi
B2
C2MB
D500m
Attempts:
3 left
💡 Hint
Common Mistakes
Using memory units like '2Gi' for CPU limits.
Using plain numbers without units for CPU limits.
3fill in blank
hard

Fix the error in the YAML to mount a volume for ML data inside the container.

MLOps
spec:
  containers:
  - name: ml-container
    volumeMounts:
    - name: data-volume
      mountPath: [1]
  volumes:
  - name: data-volume
    persistentVolumeClaim:
      claimName: ml-data-pvc
Drag options to blanks, or click blank then click option'
A/var/lib/data
B/data-volume
C/mnt/data
D/etc/data
Attempts:
3 left
💡 Hint
Common Mistakes
Using mount paths that are not directories or reserved system paths.
Using the volume name as the mount path.
4fill in blank
hard

Fill both blanks to create a Kubernetes Job that runs an ML training script once.

MLOps
apiVersion: batch/v1
kind: Job
metadata:
  name: ml-training-job
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: [1]
        command: ["python", [2]]
      restartPolicy: Never
Drag options to blanks, or click blank then click option'
Aml-training-image:latest
B"train.py"
Capp.py
Dtensorflow/tensorflow:latest
Attempts:
3 left
💡 Hint
Common Mistakes
Using a generic TensorFlow image without the training script.
Using the wrong script name in the command.
5fill in blank
hard

Fill all three blanks to define a Kubernetes Deployment for an ML model server with 3 replicas and environment variable.

MLOps
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-server
spec:
  replicas: [1]
  selector:
    matchLabels:
      app: ml-server
  template:
    metadata:
      labels:
        app: ml-server
    spec:
      containers:
      - name: model-server
        image: [2]
        env:
        - name: MODEL_NAME
          value: [3]
Drag options to blanks, or click blank then click option'
A3
Bml-model-server:latest
C"image-classifier"
D1
Attempts:
3 left
💡 Hint
Common Mistakes
Setting replicas to 1 instead of 3.
Not quoting the environment variable value.
Using wrong image names.

Practice

(1/5)
1. What is the primary Kubernetes resource used to run a one-time ML training task?
easy
A. Job
B. Deployment
C. Service
D. ConfigMap

Solution

  1. Step 1: Understand Kubernetes resource types

    Jobs are designed to run tasks that complete once, like ML training.
  2. Step 2: Match resource to ML training task

    Since training is a one-time batch task, Job is the correct resource.
  3. Final Answer:

    Job -> Option A
  4. Quick Check:

    One-time ML training = Job [OK]
Hint: Use Job for one-time tasks like training [OK]
Common Mistakes:
  • Choosing Deployment which is for long-running services
  • Confusing Service with workload resource
  • Using ConfigMap which stores config data only
2. Which of the following is the correct YAML snippet to request 2 GPUs in a Kubernetes pod spec?
easy
A. resources: requests: cpu: 2
B. resources: limits: memory: 2Gi
C. resources: limits: nvidia.com/gpu: 2
D. resources: requests: gpu: 2

Solution

  1. Step 1: Identify GPU resource naming in Kubernetes

    GPUs are requested using the vendor-specific resource name like nvidia.com/gpu.
  2. Step 2: Check correct YAML structure for limits

    GPUs are usually set under limits, not requests, with the correct key.
  3. Final Answer:

    resources: limits: nvidia.com/gpu: 2 -> Option C
  4. Quick Check:

    GPU request uses nvidia.com/gpu under limits [OK]
Hint: GPU requests use 'limits' with 'nvidia.com/gpu' key [OK]
Common Mistakes:
  • Using 'gpu' instead of 'nvidia.com/gpu'
  • Placing GPU under requests instead of limits
  • Confusing CPU or memory keys with GPU
3. Given this Kubernetes Job YAML snippet, what will happen when applied?
apiVersion: batch/v1
kind: Job
metadata:
  name: ml-train
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: ml-image:latest
        command: ["python", "train.py"]
      restartPolicy: Never
  backoffLimit: 3
medium
A. The Job runs the training once and retries up to 3 times on failure
B. The Job runs continuously without stopping
C. The Job will fail immediately due to missing restartPolicy
D. The Job creates a Deployment instead of a batch task

Solution

  1. Step 1: Understand Job behavior with backoffLimit

    The backoffLimit sets how many retries happen on failure before Job stops.
  2. Step 2: Check restartPolicy and command

    restartPolicy: Never means pods won't restart automatically; Job controller retries pods.
  3. Final Answer:

    The Job runs the training once and retries up to 3 times on failure -> Option A
  4. Quick Check:

    Job with backoffLimit retries 3 times [OK]
Hint: backoffLimit controls retry count for Job failures [OK]
Common Mistakes:
  • Thinking Job runs continuously like Deployment
  • Assuming restartPolicy: Never causes immediate failure
  • Confusing Job with Deployment resource
4. You deployed an ML model with a Deployment but the pods keep restarting. Which is the most likely cause?
medium
A. The ConfigMap is not mounted
B. The Deployment spec is missing replicas field
C. The Service is not exposing the Deployment
D. The container image is missing or incorrect

Solution

  1. Step 1: Analyze pod restart reasons

    Pods restarting often means container crashes, commonly due to bad image or command.
  2. Step 2: Check other options relevance

    Missing replicas defaults to 1, Service exposure doesn't cause restarts, ConfigMap missing causes config errors but not always restarts.
  3. Final Answer:

    The container image is missing or incorrect -> Option D
  4. Quick Check:

    Pod restarts usually mean bad container image [OK]
Hint: Pod restarts often mean container image or command error [OK]
Common Mistakes:
  • Assuming missing replicas causes restarts
  • Confusing Service exposure with pod health
  • Thinking ConfigMap absence always crashes pods
5. You want to deploy an ML model serving system that automatically scales based on CPU usage. Which Kubernetes resource and feature combination is best?
hard
A. DaemonSet to run one pod per node
B. Deployment with Horizontal Pod Autoscaler (HPA)
C. StatefulSet with persistent volumes
D. Job with backoffLimit set to 5

Solution

  1. Step 1: Identify resource for long-running model serving

    Deployment manages long-running pods and supports updates.
  2. Step 2: Choose scaling feature for CPU-based autoscaling

    Horizontal Pod Autoscaler (HPA) automatically adjusts pod count based on CPU usage.
  3. Final Answer:

    Deployment with Horizontal Pod Autoscaler (HPA) -> Option B
  4. Quick Check:

    Use Deployment + HPA for scalable model serving [OK]
Hint: Use Deployment + HPA for auto-scaling model serving [OK]
Common Mistakes:
  • Using Job which is for batch tasks, not serving
  • Choosing StatefulSet which is for stateful apps
  • DaemonSet runs pods on all nodes, not for scaling