0
0
MLOpsdevops~20 mins

Distributed training basics in MLOps - Practice Problems & Coding Challenges

Choose your learning style9 modes available
Challenge - 5 Problems
🎖️
Distributed Training Mastery
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
2:00remaining
Understanding Parameter Server Role
In a distributed training setup using a parameter server architecture, what is the primary role of the parameter server?
AIt manages the hardware resources like GPUs and CPUs.
BIt performs data preprocessing before sending batches to workers.
CIt stores and updates the global model parameters shared among workers.
DIt handles user authentication and access control for the training cluster.
Attempts:
2 left
💡 Hint
Think about where the model weights are kept and updated during training.
💻 Command Output
intermediate
2:00remaining
Output of Distributed Training Node Status Command
What is the expected output of the command kubectl get pods -l app=distributed-train if three training pods are running successfully?
MLOps
kubectl get pods -l app=distributed-train
A
NAME                 READY   STATUS    RESTARTS   AGE
train-worker-0       0/1     Pending   0          10m
train-worker-1       0/1     Pending   0          10m
train-worker-2       0/1     Pending   0          10m
B
NAME                 READY   STATUS    RESTARTS   AGE
train-worker-0       1/1     Running   0          10m
train-worker-1       1/1     Running   0          10m
train-worker-2       1/1     Running   0          10m
CError from server (NotFound): pods "train-worker-0" not found
DNo resources found in default namespace.
Attempts:
2 left
💡 Hint
Look for pods with status 'Running' and readiness '1/1'.
Configuration
advanced
2:30remaining
Correct Distributed Training Configuration Snippet
Which configuration snippet correctly sets up a distributed training job using Horovod with 4 worker replicas in Kubernetes?
A
apiVersion: batch/v1
kind: Job
metadata:
  name: horovod-job
spec:
  parallelism: 4
  template:
    spec:
      containers:
      - name: horovod
        image: horovod/horovod:latest
        command: ["horovodrun", "-np", "4", "python", "train.py"]
      restartPolicy: Never
B
apiVersion: batch/v1
kind: Job
metadata:
  name: horovod-job
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: horovod
        image: horovod/horovod:latest
        command: ["horovodrun", "-np", "4", "python", "train.py"]
      restartPolicy: Always
C
apiVersion: apps/v1
kind: Deployment
metadata:
  name: horovod-deployment
spec:
  replicas: 4
  template:
    spec:
      containers:
      - name: horovod
        image: horovod/horovod:latest
        command: ["horovodrun", "-np", "1", "python", "train.py"]
      restartPolicy: Always
D
apiVersion: batch/v1
kind: Job
metadata:
  name: horovod-job
spec:
  parallelism: 1
  template:
    spec:
      containers:
      - name: horovod
        image: horovod/horovod:latest
        command: ["horovodrun", "-np", "4", "python", "train.py"]
      restartPolicy: Never
Attempts:
2 left
💡 Hint
Check the correct field for parallelism in a Kubernetes Job and the restart policy.
Troubleshoot
advanced
2:00remaining
Diagnosing Worker Node Failure in Distributed Training
During distributed training, one worker node repeatedly crashes with an error related to 'connection refused' when trying to reach the parameter server. What is the most likely cause?
AThe parameter server is not running or not reachable on the expected network address.
BThe worker node has insufficient GPU memory to run the training batch.
CThe worker node's disk is full, preventing checkpoint saving.
DThe training script has a syntax error causing the crash.
Attempts:
2 left
💡 Hint
Connection refused usually means network or service availability issues.
🔀 Workflow
expert
3:00remaining
Correct Sequence for Distributed Training Job Deployment
Arrange the steps in the correct order to deploy a distributed training job on a Kubernetes cluster using a containerized training image.
A3,1,2,4
B2,1,3,4
C1,3,2,4
D1,2,3,4
Attempts:
2 left
💡 Hint
Think about the logical order from building to running and monitoring.