Imagine a machine learning model served from a single central server to users worldwide. What is the main effect of this setup on latency?
Think about how distance affects travel time for data packets.
A centralized server means users far away have longer network paths, increasing latency.
What is the main reason a distributed serving architecture can reduce latency for users?
Consider how physical proximity affects data travel time.
Distributed serving places servers near users, so data travels shorter distances, lowering latency.
Given this autoscaling configuration snippet for a model serving deployment, what is the expected effect on cost when traffic spikes?
autoscaling: min_replicas: 1 max_replicas: 10 target_cpu_utilization_percentage: 50
Autoscaling adds more replicas when CPU usage is high.
Autoscaling increases replicas during high traffic, raising cost due to more resources used.
A distributed serving system still shows high latency for some users. Which issue is the most likely cause?
Think about factors outside the server that affect latency.
Even with distributed servers, network congestion can cause delays for users.
Which serving architecture best balances low latency and controlled cost for a global user base with variable traffic?
Consider how autoscaling helps manage cost during low traffic.
Distributed servers reduce latency by proximity; autoscaling controls cost by adjusting replicas to traffic.