Recall & Review
beginner
What is an auto-scaling inference endpoint?
An auto-scaling inference endpoint automatically adjusts the number of servers handling machine learning model predictions based on the current demand. This helps keep response times fast and costs low.
Click to reveal answer
beginner
Why is auto-scaling important for inference endpoints?
Auto-scaling ensures that the system can handle sudden increases or decreases in prediction requests without delays or wasted resources, similar to how a store opens more checkout counters when many customers arrive.
Click to reveal answer
intermediate
Name two common metrics used to trigger auto-scaling for inference endpoints.
Common metrics include CPU usage and request latency. When CPU usage is high or latency increases, the system adds more servers to handle the load.
Click to reveal answer
intermediate
What is the difference between horizontal and vertical scaling in the context of inference endpoints?
Horizontal scaling adds or removes servers (machines) to handle load, while vertical scaling changes the resources (CPU, memory) of a single server. Auto-scaling usually refers to horizontal scaling.
Click to reveal answer
beginner
How does auto-scaling help reduce costs in machine learning inference?
By only running the number of servers needed for current demand, auto-scaling avoids paying for idle resources, similar to turning off lights in empty rooms to save electricity.
Click to reveal answer
What does auto-scaling inference endpoints adjust automatically?
✗ Incorrect
Auto-scaling changes the number of servers to match the demand for predictions.
Which metric is commonly used to trigger auto-scaling?
✗ Incorrect
CPU usage indicates how busy the servers are and helps decide when to add or remove servers.
What is horizontal scaling in inference endpoints?
✗ Incorrect
Horizontal scaling means adding or removing servers to handle load.
How does auto-scaling help with cost savings?
✗ Incorrect
Auto-scaling avoids paying for unused servers by adjusting capacity to demand.
When might an auto-scaling system reduce the number of servers?
✗ Incorrect
Auto-scaling reduces servers when fewer prediction requests come in.
Explain how auto-scaling inference endpoints work and why they are useful.
Think about how a busy store opens more checkout lines when many customers arrive.
You got /4 concepts.
Describe the difference between horizontal and vertical scaling in the context of inference endpoints.
Horizontal means more machines; vertical means bigger machines.
You got /3 concepts.