Which metric is most commonly used to trigger auto-scaling of inference endpoints in a cloud environment?
Think about what resource usage directly affects the ability to handle inference requests.
CPU utilization is a direct indicator of how busy the inference endpoint is. When CPU usage is high, scaling out helps handle more requests.
Given the following CLI output from an auto-scaling tool monitoring an inference endpoint, what is the current number of active instances?
Endpoint: model-v1 Instances: 3 CPU Utilization: 75% Scaling Status: Stable
Look for the line that indicates how many instances are running.
The line 'Instances: 3' shows that there are currently 3 active instances serving the endpoint.
Which YAML snippet correctly configures an auto-scaling policy to scale out when CPU usage exceeds 70% and scale in when below 30%?
Look for explicit scale out and scale in thresholds.
Option C explicitly sets scaleOutCPUThreshold and scaleInCPUThreshold, which are required to define when to scale out and in.
An inference endpoint is not scaling out despite high CPU usage. Which of the following is the most likely cause?
Check if the scaling limits allow more instances to be created.
If maxInstances is set to 1, the system cannot add more instances even if CPU usage is high, preventing scale out.
Arrange the steps in the correct order for an auto-scaling workflow of an inference endpoint.
Think about monitoring first, then triggering, then adding instances, then routing traffic.
The workflow starts with monitoring metrics, then triggers scaling, adds instances, and finally distributes requests.