MLOpsdevops~5 mins

Kubernetes for ML workloads in MLOps - Commands & Configuration

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Kubernetes helps run machine learning tasks reliably by managing computing resources and scaling automatically. It solves the problem of running ML training or inference jobs on many machines without manual setup.

When you want to train a machine learning model on multiple servers to speed up the process.

When you need to deploy a trained ML model as a service that can handle many user requests.

When your ML workload requires automatic restarting if a training job fails.

When you want to run batch ML jobs that start and stop without manual intervention.

When you want to share GPU resources among different ML tasks efficiently.

Config File - ml-job.yaml

ml-job.yaml

apiVersion: batch/v1
kind: Job
metadata:
  name: ml-training-job
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: tensorflow/tensorflow:2.12.0
        command: ["python", "train.py"]
        resources:
          limits:
            nvidia.com/gpu: 1
      restartPolicy: Never
  backoffLimit: 4

This YAML file defines a Kubernetes Job to run a machine learning training task.

apiVersion and kind specify this is a batch job.

metadata.name names the job.

spec.template.spec.containers defines the container image and command to run the training script.

resources.limits requests one GPU for the container.

restartPolicy: Never means the pod won't restart automatically if it fails, but Kubernetes will retry the job up to backoffLimit times.

Commands

This command creates the ML training job in Kubernetes using the configuration file. It tells Kubernetes to start the job with the specified container and resources.

Terminal

kubectl apply -f ml-job.yaml

Expected OutputExpected

job.batch/ml-training-job created

This command lists all batch jobs running or completed in the Kubernetes cluster to check the status of the ML training job.

Terminal

kubectl get jobs

Expected OutputExpected

NAME COMPLETIONS DURATION AGE ml-training-job 0/1 10s 15s

This command lists the pods created by the ML training job to see if the training container is running or completed.

Terminal

kubectl get pods -l job-name=ml-training-job

Expected OutputExpected

NAME READY STATUS RESTARTS AGE ml-training-job-abc123 1/1 Running 0 20s

→

-l job-name=ml-training-job - Filter pods by the job name label

This command shows the output logs of the ML training container to monitor training progress or debug errors.

Terminal

kubectl logs ml-training-job-abc123

Expected OutputExpected

Epoch 1/10 loss: 0.45 - accuracy: 0.85 Epoch 2/10 loss: 0.30 - accuracy: 0.90 Training complete.

Key Concept

If you remember nothing else from this pattern, remember: Kubernetes Jobs let you run ML training tasks reliably with automatic retries and resource management.

Common Mistakes

Not specifying restartPolicy: Never in the job pod spec

Without restartPolicy: Never, the pod may restart endlessly on failure, causing unexpected resource use.

Always set restartPolicy: Never for batch jobs to let Kubernetes handle retries at the job level.

Forgetting to request GPU resources in the container spec

The training job will run without GPU acceleration, making training slower or failing if GPU is required.

Add resource limits like nvidia.com/gpu: 1 to request GPU access for the container.

Not checking pod logs to monitor training progress

You miss important feedback on training status or errors, making debugging harder.

Use kubectl logs on the job pod to see real-time output from the training script.

Summary

Create a Kubernetes Job YAML file to define the ML training task with container image, command, and resource requests.

Use kubectl apply to start the job and kubectl get jobs to check its status.

List pods created by the job and view their logs to monitor training progress and troubleshoot.

Practice

(1/5)

1. What is the primary Kubernetes resource used to run a one-time ML training task?

easy

A. Job

B. Deployment

C. Service

D. ConfigMap

Kubernetes for ML workloads in MLOps - Commands & Configuration

Start learning this pattern below

Practice

Solution

Step 1: Understand Kubernetes resource types

Step 2: Match resource to ML training task

Final Answer:

Quick Check:

Solution

Step 1: Identify GPU resource naming in Kubernetes

Step 2: Check correct YAML structure for limits

Final Answer:

Quick Check:

Solution

Step 1: Understand Job behavior with backoffLimit

Step 2: Check restartPolicy and command

Final Answer:

Quick Check:

Solution

Step 1: Analyze pod restart reasons

Step 2: Check other options relevance

Final Answer:

Quick Check:

Solution

Step 1: Identify resource for long-running model serving

Step 2: Choose scaling feature for CPU-based autoscaling

Final Answer:

Quick Check: