MLOpsdevops~20 mins

Kubernetes for ML workloads in MLOps - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Challenge - 5 Problems

🎖️

Kubernetes ML Master

Get all challenges correct to earn this badge!

Test your skills under time pressure!

🧠 Conceptual

intermediate

2:00remaining

Understanding Kubernetes Pod Scheduling for ML Jobs

Which Kubernetes feature ensures that ML training pods are scheduled on nodes with GPUs?

ANode Affinity with GPU label selectors

BHorizontal Pod Autoscaler

CConfigMap volume mounts

DPod Security Policies

Attempts:

2 left

💻 Command Output

intermediate

2:00remaining

Output of kubectl describe on ML Training Pod

What is the output of kubectl describe pod ml-train-pod if the pod is pending due to insufficient GPU resources?

MLOps

kubectl describe pod ml-train-pod

Events:
  Type     Normal  Reason    Age   From               Message
  ----     ------  ------    ----  ----               -------
  Normal   Scheduled  1m    default-scheduler  Successfully assigned ml-train-pod to node-1

Status: Running
Containers:
  ml-container:
    State: Running
    Ready: True

CError from server (NotFound): pods "ml-train-pod" not found

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  2m    default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu.

Attempts:

2 left

❓ Configuration

advanced

2:30remaining

Configuring Persistent Storage for ML Data in Kubernetes

Which YAML snippet correctly defines a PersistentVolumeClaim (PVC) for 50Gi of storage with ReadWriteOnce access mode suitable for ML training data?

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-data-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    limits:
      storage: 50Gi

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-data-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    limits:
      storage: 50Gi

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-data-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ml-data-pvc
spec:
  accessModes:
    - ReadOnlyMany
  resources:
    requests:
      storage: 50Gi

Attempts:

2 left

🔀 Workflow

advanced

3:00remaining

Steps to Deploy a Distributed ML Training Job on Kubernetes

What is the correct order of steps to deploy a distributed ML training job using Kubernetes?

A3,1,2,4

B1,3,2,4

C1,2,3,4

D2,1,3,4

Attempts:

2 left

❓ Troubleshoot

expert

3:00remaining

Diagnosing ML Pod CrashLoopBackOff due to GPU Driver Issues

An ML training pod repeatedly crashes with CrashLoopBackOff. Logs show Failed to initialize GPU device. What is the most likely cause?

AThe node lacks the NVIDIA GPU device plugin or drivers are not installed properly.

BThe pod's container image is missing the ML training code.

CThe PersistentVolumeClaim is not bound to a PersistentVolume.

DThe pod's CPU resource requests exceed node capacity.

Attempts:

2 left

Practice

(1/5)

1. What is the primary Kubernetes resource used to run a one-time ML training task?

easy

A. Job

B. Deployment

C. Service

D. ConfigMap

Kubernetes for ML workloads in MLOps - Practice Problems & Coding Challenges

Start learning this pattern below

Practice

Solution

Step 1: Understand Kubernetes resource types

Step 2: Match resource to ML training task

Final Answer:

Quick Check:

Solution

Step 1: Identify GPU resource naming in Kubernetes

Step 2: Check correct YAML structure for limits

Final Answer:

Quick Check:

Solution

Step 1: Understand Job behavior with backoffLimit

Step 2: Check restartPolicy and command

Final Answer:

Quick Check:

Solution

Step 1: Analyze pod restart reasons

Step 2: Check other options relevance

Final Answer:

Quick Check:

Solution

Step 1: Identify resource for long-running model serving

Step 2: Choose scaling feature for CPU-based autoscaling

Final Answer:

Quick Check: