Bird
Raised Fist0
MLOpsdevops~7 mins

Auto-scaling inference endpoints in MLOps - Commands & Configuration

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
When you deploy machine learning models to serve predictions, traffic can change a lot. Auto-scaling inference endpoints automatically adjust the number of servers running your model to handle more or less traffic without wasting resources or causing delays.
When your app gets more users suddenly and you want predictions to stay fast without manual setup
When traffic to your model varies during the day and you want to save money by not running too many servers
When you want your ML service to be reliable and handle unexpected spikes smoothly
When you deploy models in the cloud and want to use built-in scaling features
When you want to avoid downtime caused by too many requests hitting a single server
Config File - inference_endpoint.yaml
inference_endpoint.yaml
apiVersion: mlops.example.com/v1
kind: InferenceEndpoint
metadata:
  name: my-model-endpoint
spec:
  model:
    name: my-ml-model
    version: v1
  autoscaling:
    minReplicas: 1
    maxReplicas: 5
    targetCPUUtilizationPercentage: 60
  resources:
    requests:
      cpu: 500m
      memory: 1Gi
    limits:
      cpu: 1
      memory: 2Gi

This file defines an inference endpoint for your ML model named my-ml-model version v1. The autoscaling section sets the minimum and maximum number of server replicas to run, and the target CPU usage to trigger scaling. The resources section requests and limits CPU and memory for each replica to ensure stable performance.

Commands
This command creates or updates the inference endpoint in your Kubernetes cluster using the configuration file. It sets up the model deployment with auto-scaling rules.
Terminal
kubectl apply -f inference_endpoint.yaml
Expected OutputExpected
inferenceendpoint.mlops.example.com/my-model-endpoint created
This command lists the pods running your model endpoint to check how many replicas are active after deployment.
Terminal
kubectl get pods -l app=my-model-endpoint
Expected OutputExpected
NAME READY STATUS RESTARTS AGE my-model-endpoint-5f7d8c9f7f-abcde 1/1 Running 0 30s
-l - Filter pods by label to show only those related to the model endpoint
This command shows the current CPU and memory usage of the pods running your model. It helps verify if auto-scaling triggers are based on resource use.
Terminal
kubectl top pods -l app=my-model-endpoint
Expected OutputExpected
NAME CPU(cores) MEMORY(bytes) my-model-endpoint-5f7d8c9f7f-abcde 300m 700Mi
-l - Filter pods by label to show only those related to the model endpoint
This command shows the Horizontal Pod Autoscaler status for your model endpoint, including current replicas and CPU usage percentage.
Terminal
kubectl get hpa my-model-endpoint
Expected OutputExpected
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE my-model-endpoint Deployment/my-model-endpoint 45%/60% 1 5 2 2m
Key Concept

If you remember nothing else from this pattern, remember: auto-scaling adjusts the number of model servers automatically based on resource use to keep predictions fast and cost-effective.

Common Mistakes
Setting minReplicas and maxReplicas to the same number
This disables auto-scaling because the number of replicas cannot change, defeating the purpose.
Set minReplicas lower than maxReplicas to allow scaling up and down.
Not specifying resource requests and limits for pods
Without resource requests, the auto-scaler cannot measure CPU or memory usage properly, so scaling won't work.
Always define CPU and memory requests and limits in the pod spec.
Ignoring the targetCPUUtilizationPercentage setting
If this value is too high or too low, scaling may happen too late or too often, causing delays or wasted resources.
Choose a balanced target CPU percentage like 60% to trigger scaling at the right time.
Summary
Create an inference endpoint configuration with auto-scaling rules and resource limits.
Apply the configuration to deploy the model and enable auto-scaling.
Check running pods and their resource usage to monitor scaling behavior.
Use the Horizontal Pod Autoscaler status to see current scaling state and adjust settings if needed.

Practice

(1/5)
1. What is the main purpose of auto-scaling inference endpoints in ML services?
easy
A. To automatically adjust the number of servers based on traffic
B. To manually add servers when traffic increases
C. To reduce the accuracy of ML models during high traffic
D. To store more data for training models

Solution

  1. Step 1: Understand auto-scaling concept

    Auto-scaling means the system changes the number of servers automatically depending on the traffic load.
  2. Step 2: Identify the purpose in ML inference

    For ML inference endpoints, auto-scaling keeps the service fast and cost-efficient by adjusting servers without manual work.
  3. Final Answer:

    To automatically adjust the number of servers based on traffic -> Option A
  4. Quick Check:

    Auto-scaling = automatic server adjustment [OK]
Hint: Auto-scaling means automatic server count change [OK]
Common Mistakes:
  • Thinking auto-scaling requires manual server changes
  • Confusing auto-scaling with model accuracy changes
  • Believing auto-scaling stores training data
2. Which configuration setting defines the minimum number of servers to keep running in an auto-scaling inference endpoint?
easy
A. max_servers
B. scale_up_threshold
C. target_utilization
D. min_servers

Solution

  1. Step 1: Identify minimum server setting

    The minimum number of servers to keep running is controlled by the setting named min_servers.
  2. Step 2: Differentiate from other settings

    max_servers sets the upper limit, target_utilization controls load target, and scale_up_threshold is not a standard setting here.
  3. Final Answer:

    min_servers -> Option D
  4. Quick Check:

    Minimum servers = min_servers [OK]
Hint: Min servers setting always starts with 'min_' [OK]
Common Mistakes:
  • Confusing max_servers with minimum servers
  • Mixing target utilization with server count
  • Using non-existent settings like scale_up_threshold
3. Given this auto-scaling config snippet:
{
  "min_servers": 2,
  "max_servers": 5,
  "target_utilization": 0.7
}

If the current server usage is 80%, what will likely happen?
medium
A. The system will scale up servers to reduce load
B. The system will scale down servers to save cost
C. The system will keep the same number of servers
D. The system will shut down all servers

Solution

  1. Step 1: Compare current usage to target utilization

    The current usage (80%) is higher than the target utilization (70%).
  2. Step 2: Determine scaling action

    Since usage is above target, the system will add servers (scale up) to reduce load and meet the target.
  3. Final Answer:

    The system will scale up servers to reduce load -> Option A
  4. Quick Check:

    Usage > target = scale up [OK]
Hint: If usage > target, scale up servers [OK]
Common Mistakes:
  • Scaling down when usage is above target
  • Assuming no change if usage is slightly above target
  • Thinking system shuts down servers automatically
4. You configured an auto-scaling endpoint with min_servers: 1 and max_servers: 3. The system never scales above 1 server even under high load. What is the most likely cause?
medium
A. The max_servers is set too low to allow scaling
B. The target utilization is set too high, preventing scale up
C. The min_servers value is incorrectly set to 3
D. The system does not support auto-scaling

Solution

  1. Step 1: Analyze scaling limits

    Min servers is 1 and max servers is 3, so scaling up to 3 is allowed.
  2. Step 2: Check target utilization impact

    If target utilization is set very high (e.g., 90%+), the system thinks current load is acceptable and won't scale up.
  3. Final Answer:

    The target utilization is set too high, preventing scale up -> Option B
  4. Quick Check:

    High target utilization blocks scaling up [OK]
Hint: High target utilization can block scaling up [OK]
Common Mistakes:
  • Confusing max_servers as too low when it allows scaling
  • Misreading min_servers as max_servers
  • Assuming system lacks auto-scaling support
5. You want to configure an auto-scaling inference endpoint that never drops below 2 servers, never exceeds 6 servers, and aims to keep CPU usage around 60%. Which configuration is correct?
hard
A. { "min_servers": 2, "max_servers": 6, "target_utilization": 0.9 }
B. { "min_servers": 6, "max_servers": 2, "target_utilization": 0.6 }
C. { "min_servers": 2, "max_servers": 6, "target_utilization": 0.6 }
D. { "min_servers": 1, "max_servers": 6, "target_utilization": 0.6 }

Solution

  1. Step 1: Set minimum and maximum servers correctly

    Minimum servers should be 2 and maximum servers 6, so min_servers: 2 and max_servers: 6 are correct.
  2. Step 2: Set target utilization to 60%

    Target utilization should be 0.6 (60%) to keep CPU usage around that level.
  3. Step 3: Verify options

    { "min_servers": 2, "max_servers": 6, "target_utilization": 0.6 } matches all requirements. { "min_servers": 6, "max_servers": 2, "target_utilization": 0.6 } reverses min and max servers. { "min_servers": 2, "max_servers": 6, "target_utilization": 0.9 } has wrong target utilization. { "min_servers": 1, "max_servers": 6, "target_utilization": 0.6 } has min_servers as 1, which is below requirement.
  4. Final Answer:

    { "min_servers": 2, "max_servers": 6, "target_utilization": 0.6 } -> Option C
  5. Quick Check:

    Correct min, max, and target utilization = { "min_servers": 2, "max_servers": 6, "target_utilization": 0.6 } [OK]
Hint: Min ≤ max and target_utilization as decimal (0.6) [OK]
Common Mistakes:
  • Swapping min_servers and max_servers values
  • Using target_utilization as percentage (60) instead of decimal (0.6)
  • Setting min_servers lower than required