0
0
MLOpsdevops~7 mins

Auto-scaling inference endpoints in MLOps - Commands & Configuration

Choose your learning style9 modes available
Introduction
When you deploy machine learning models to serve predictions, traffic can change a lot. Auto-scaling inference endpoints automatically adjust the number of servers running your model to handle more or less traffic without wasting resources or causing delays.
When your app gets more users suddenly and you want predictions to stay fast without manual setup
When traffic to your model varies during the day and you want to save money by not running too many servers
When you want your ML service to be reliable and handle unexpected spikes smoothly
When you deploy models in the cloud and want to use built-in scaling features
When you want to avoid downtime caused by too many requests hitting a single server
Config File - inference_endpoint.yaml
inference_endpoint.yaml
apiVersion: mlops.example.com/v1
kind: InferenceEndpoint
metadata:
  name: my-model-endpoint
spec:
  model:
    name: my-ml-model
    version: v1
  autoscaling:
    minReplicas: 1
    maxReplicas: 5
    targetCPUUtilizationPercentage: 60
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1
      memory: 2Gi

This file defines an inference endpoint for your ML model named my-ml-model version v1. The autoscaling section sets the minimum and maximum number of server replicas to run, and the target CPU usage to trigger scaling. The resources section requests and limits CPU and memory for each replica to ensure stable performance.

Commands
This command creates or updates the inference endpoint in your Kubernetes cluster using the configuration file. It sets up the model deployment with auto-scaling rules.
Terminal
kubectl apply -f inference_endpoint.yaml
Expected OutputExpected
inferenceendpoint.mlops.example.com/my-model-endpoint created
This command lists the pods running your model endpoint to check how many replicas are active after deployment.
Terminal
kubectl get pods -l app=my-model-endpoint
Expected OutputExpected
NAME READY STATUS RESTARTS AGE my-model-endpoint-5f7d8c9f7f-abcde 1/1 Running 0 30s
-l - Filter pods by label to show only those related to the model endpoint
This command shows the current CPU and memory usage of the pods running your model. It helps verify if auto-scaling triggers are based on resource use.
Terminal
kubectl top pods -l app=my-model-endpoint
Expected OutputExpected
NAME CPU(cores) MEMORY(bytes) my-model-endpoint-5f7d8c9f7f-abcde 300m 700Mi
-l - Filter pods by label to show only those related to the model endpoint
This command shows the Horizontal Pod Autoscaler status for your model endpoint, including current replicas and CPU usage percentage.
Terminal
kubectl get hpa my-model-endpoint
Expected OutputExpected
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE my-model-endpoint Deployment/my-model-endpoint 45%/60% 1 5 2 2m
Key Concept

If you remember nothing else from this pattern, remember: auto-scaling adjusts the number of model servers automatically based on resource use to keep predictions fast and cost-effective.

Common Mistakes
Setting minReplicas and maxReplicas to the same number
This disables auto-scaling because the number of replicas cannot change, defeating the purpose.
Set minReplicas lower than maxReplicas to allow scaling up and down.
Not specifying resource requests and limits for pods
Without resource requests, the auto-scaler cannot measure CPU or memory usage properly, so scaling won't work.
Always define CPU and memory requests and limits in the pod spec.
Ignoring the targetCPUUtilizationPercentage setting
If this value is too high or too low, scaling may happen too late or too often, causing delays or wasted resources.
Choose a balanced target CPU percentage like 60% to trigger scaling at the right time.
Summary
Create an inference endpoint configuration with auto-scaling rules and resource limits.
Apply the configuration to deploy the model and enable auto-scaling.
Check running pods and their resource usage to monitor scaling behavior.
Use the Horizontal Pod Autoscaler status to see current scaling state and adjust settings if needed.