Latency means how fast a model gives an answer after you ask it. The key metric here is response time, usually measured in milliseconds (ms). Lower latency means faster answers, which is important for real-time apps like chatbots or voice assistants. Sometimes, throughput (how many requests per second a system can handle) also matters if many users ask at once. But the main focus is on making each answer come quickly without waiting.
Latency optimization in Prompt Engineering / GenAI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Latency optimization does not use a confusion matrix because it is not about right or wrong answers. Instead, we look at timing data like this:
Request # | Start Time (ms) | End Time (ms) | Latency (ms)
--------- | -------------- | ------------ | ------------
1 | 1000 | 1020 | 20
2 | 1025 | 1045 | 20
3 | 1050 | 1080 | 30
4 | 1085 | 1100 | 15
Average Latency = (20 + 20 + 30 + 15) / 4 = 21.25 ms
This table shows how long each request took. We want to reduce the average latency number.
When optimizing latency, there is often a tradeoff between speed and accuracy. Making a model faster might mean it uses simpler calculations or fewer steps, which can reduce accuracy. For example:
- A chatbot that answers quickly but sometimes gives less detailed answers.
- A voice assistant that responds fast but may misunderstand complex questions.
Choosing the right balance depends on the app's needs. For urgent tasks, speed is more important. For detailed tasks, accuracy matters more.
Good latency: Under 100 ms for interactive apps feels instant to users. For example, a chatbot responding in 50 ms is excellent.
Bad latency: Over 500 ms can feel slow and frustrating. If a voice assistant takes 1 second or more, users may lose patience.
Remember, what is "good" depends on the app. A batch job running overnight can have high latency without problems.
- Ignoring variability: Average latency can hide spikes. Always check max and percentiles (like 95th percentile) to see worst delays.
- Overfitting to speed: Making a model too simple to be fast can hurt accuracy badly.
- Data leakage: Using future data to speed up predictions is cheating and breaks real-world use.
- Not testing in real conditions: Latency in a lab may be low but real users face network delays and slow devices.
Your chatbot model has an average latency of 80 ms but sometimes spikes to 600 ms on some requests. Is this good for a live chat app? Why or why not?
Answer: The average latency of 80 ms is good and feels fast. But spikes to 600 ms can make some answers feel slow and frustrate users. For live chat, consistent speed is important, so you should work to reduce those spikes for a better experience.
Practice
Solution
Step 1: Understand latency meaning
Latency means the time it takes for a model to give an answer after input.Step 2: Connect latency to user experience
Lower latency means faster responses, which improves how users feel about the AI.Final Answer:
To make AI models respond faster for better user experience -> Option AQuick Check:
Latency optimization = faster response [OK]
- Confusing latency with model size
- Thinking latency means accuracy
- Assuming more layers reduce latency
Solution
Step 1: Identify correct time functions
Python's time.time() returns current time in seconds; subtracting gives elapsed time.Step 2: Check latency calculation
Latency = end - start measures duration correctly; other options misuse functions or operations.Final Answer:
start = time.time(); model.predict(x); end = time.time(); latency = end - start -> Option BQuick Check:
Latency = end - start time [OK]
- Using time.now() which does not exist
- Subtracting start - end instead of end - start
- Using time.sleep() which pauses code, not measures time
import time
start = time.time()
for _ in range(1000000):
pass
end = time.time()
print(round(end - start, 2))Solution
Step 1: Understand the loop workload
The loop runs 1,000,000 times doing nothing (pass), which takes very little time due to Python's loop execution speed.Step 2: Estimate time taken
On a normal computer, this empty loop takes around 0.03-0.1 seconds, so round(end - start, 2) prints a number close to 0.00.Final Answer:
A number close to 0.00 -> Option DQuick Check:
1M empty loops ~0.05s [OK]
- Overestimating time for empty loop (e.g., thinking 1 second)
- Thinking it takes 10 seconds
- Assuming syntax error due to indentation
Solution
Step 1: Understand pruning effect
Pruning removes less important parts to speed up model, so latency should decrease if done right.Step 2: Identify why latency increased
If latency increased, pruning likely added overhead or was done incorrectly, causing slower execution.Final Answer:
Pruning was done incorrectly causing overhead in model execution -> Option CQuick Check:
Incorrect pruning = more overhead = higher latency [OK]
- Assuming pruning always slows model
- Ignoring measurement timing
- Thinking pruning removes important layers by default
Solution
Step 1: Identify techniques for latency reduction on mobile
Quantization reduces number size to speed up computation; pruning removes unneeded parts to shrink model.Step 2: Evaluate options
Increasing layers or bigger models increase latency; caching helps but alone is not enough for mobile constraints.Final Answer:
Use quantization to reduce precision and prune unimportant weights -> Option AQuick Check:
Quantization + pruning = best latency reduction [OK]
- Thinking bigger models reduce latency
- Relying only on caching for mobile speed
- Ignoring model size impact on mobile devices
