Which Kubernetes feature ensures that ML training pods are scheduled on nodes with GPUs?
Think about how Kubernetes selects nodes based on hardware capabilities.
Node Affinity allows pods to be scheduled on nodes that have specific labels, such as those indicating GPU availability. This is essential for ML workloads requiring GPUs.
What is the output of kubectl describe pod ml-train-pod if the pod is pending due to insufficient GPU resources?
kubectl describe pod ml-train-pod
Look for messages about resource availability in the pod events.
The pod is pending because no nodes have enough GPUs available. The scheduler logs this as a FailedScheduling event with a message about insufficient GPUs.
Which YAML snippet correctly defines a PersistentVolumeClaim (PVC) for 50Gi of storage with ReadWriteOnce access mode suitable for ML training data?
Check the correct field for storage size and access mode for single node write access.
The correct PVC uses requests for storage size and ReadWriteOnce for access mode, allowing one node to write to the volume.
What is the correct order of steps to deploy a distributed ML training job using Kubernetes?
Think about building the image before pushing and defining manifests before applying.
First build the Docker image, then push it to a registry, next define manifests referencing the image, and finally apply manifests to run the job.
An ML training pod repeatedly crashes with CrashLoopBackOff. Logs show Failed to initialize GPU device. What is the most likely cause?
Consider hardware and driver compatibility for GPU access.
GPU initialization failures usually indicate missing or misconfigured GPU drivers or device plugins on the node, causing the pod to crash.