Which factor most directly affects the latency of a machine learning model during inference?
Think about what happens when the model makes a prediction.
Latency during inference depends mainly on how complex the model is, which is related to its layers and parameters. Larger models take longer to compute predictions.
What is the output of the following code that simulates latency for different batch sizes?
import time def simulate_latency(batch_size): base_time = 0.01 # seconds per sample total_time = base_time * batch_size time.sleep(total_time) return total_time latencies = {b: simulate_latency(b) for b in [1, 5, 10]} print(latencies)
Latency scales linearly with batch size in this simulation.
The function multiplies the base time per sample by the batch size, so latency increases linearly.
You need to deploy a model on a device with limited processing power and require very low latency. Which model architecture is best suited?
Think about model size and computation needed for fast predictions.
Decision trees are lightweight and fast to evaluate, making them suitable for low latency on limited hardware.
Which hyperparameter adjustment is most likely to reduce inference latency without retraining the model?
Consider what happens when you process fewer samples at once.
Lowering batch size reduces the amount of data processed at once, which can reduce latency per batch during inference.
Given the code below, which line is the main cause of increased latency during inference?
def predict(model, data):
results = []
for sample in data:
processed = preprocess(sample)
output = model(processed)
results.append(output)
return results
# preprocess is slow due to heavy image resizing
# model is optimized and fastFocus on which step is described as slow.
The preprocessing step is slow due to heavy image resizing, causing the latency bottleneck.