For retry and fallback logic in AI systems, the key metrics are success rate and latency. Success rate shows how often the system recovers from failures by retrying or using fallback. Latency measures the time delay caused by retries or fallback steps. We want a high success rate to keep the system reliable, but also low latency so users don't wait too long. Balancing these metrics helps ensure the system is both dependable and fast.
Retry and fallback logic in Agentic AI - Model Metrics & Evaluation
Start learning this pattern below
Jump into concepts and practice - no test required
Retry/Fallback Outcome Matrix:
| Outcome | Count |
|-----------------------------|-------|
| Success First Try (No Retry) | 800 |
| Success After Retry | 150 |
| Success After Fallback | 30 |
| Failure After All Attempts | 20 |
Total Requests = 1000
Success Rate = (800 + 150 + 30) / 1000 = 0.98 (98%)
Failure Rate = 20 / 1000 = 0.02 (2%)
Average Latency = (800 * 1s + 150 * 3s + 30 * 5s + 20 * 5s) / 1000 = 1.5 seconds
Explanation:
- 1s = normal response time
- 3s = retry delay included
- 5s = fallback delay included
In retry and fallback logic, the tradeoff is between retry aggressiveness and system responsiveness.
- More retries: Increase success rate (like recall) by catching more failures, but increase latency (slow response).
- Fewer retries: Faster responses but risk more failures (lower success rate).
Example: A voice assistant that retries too much may respond correctly more often but annoy users with delays. If it retries less, it responds faster but may fail more.
- Good: Success rate > 95%, average latency < 2 seconds. This means most requests succeed quickly.
- Bad: Success rate < 90%, average latency > 5 seconds. Many failures or long waits frustrate users.
- Warning: Success rate near 100% but latency very high (>10 seconds) means retries/fallbacks work but slow the system too much.
- Ignoring latency: High success rate alone can hide poor user experience if retries cause long delays.
- Data leakage: Using future information to decide retries can inflate success rate unrealistically.
- Overfitting retry logic: Tuning retries only on test data may fail in real-world diverse failures.
- Counting partial successes: Treating fallback partial results as full success can mislead metrics.
Your AI system has a 98% success rate but an average latency of 8 seconds due to many retries and fallbacks. Is this good for production? Why or why not?
Answer: No, because although the system succeeds often, the high latency means users wait too long. This hurts user experience and may cause frustration. You should reduce retries or optimize fallback to lower latency while keeping success rate high.
Practice
What is the main purpose of retry logic in an AI system?
Solution
Step 1: Understand retry logic concept
Retry logic means trying the same task again if it fails temporarily, like retrying a phone call if the line is busy.Step 2: Match retry logic to options
Only To try a task multiple times to handle temporary failures describes trying multiple times to handle temporary failures, which fits retry logic.Final Answer:
To try a task multiple times to handle temporary failures -> Option DQuick Check:
Retry logic = multiple attempts [OK]
- Confusing retry with fallback
- Thinking retry stops after one failure
- Assuming retry changes the task
Which of the following is the correct Python syntax to retry a function fetch_data() up to 3 times?
for _ in range(3):
try:
fetch_data()
break
except Exception:
passSolution
Step 1: Check syntax for retry loop
The code uses a for loop to try 3 times, with try-except to catch errors and break if successful.Step 2: Identify correct syntax
for _ in range(3): try: fetch_data() break except Exception: pass matches the correct Python syntax with try-except inside the loop and break on success.Final Answer:
for _ in range(3): try: fetch_data() break except Exception: pass -> Option AQuick Check:
Correct retry loop syntax = for _ in range(3): try: fetch_data() break except Exception: pass [OK]
- Missing try-except block
- Incorrect loop syntax
- Using 'except' without 'try'
Consider this code snippet implementing retry and fallback logic:
def get_data():
for _ in range(2):
try:
return fetch_from_primary()
except Exception:
pass
return fetch_from_backup()If fetch_from_primary() fails both times, what will get_data() return?
Solution
Step 1: Analyze retry attempts
The function tries fetch_from_primary() twice inside the loop, catching exceptions and continuing if it fails.Step 2: Understand fallback behavior
If both retries fail, the function calls and returns fetch_from_backup() as a fallback.Final Answer:
The result of fetch_from_backup() -> Option BQuick Check:
Retries fail -> fallback used = The result of fetch_from_backup() [OK]
- Assuming primary always returns result
- Ignoring fallback call
- Thinking exception propagates
Identify the bug in this retry and fallback code snippet:
def get_info():
for i in range(3):
try:
return fetch_data()
except:
continue
return fallback_data()Solution
Step 1: Review exception handling
The except block catches all exceptions without specifying the exception type, which is bad practice and can hide bugs.Step 2: Identify best practice
It's better to catch specific exceptions to avoid masking unexpected errors.Final Answer:
The except block catches all exceptions without specifying type -> Option AQuick Check:
Catch specific exceptions, not all [OK]
- Using bare except blocks
- Ignoring exception types
- Assuming unused variables cause bugs
You want to design an AI agent that tries to fetch user data from a primary server up to 3 times. If all retries fail, it should fetch from a backup server. Which code snippet correctly implements this retry and fallback logic?
Option A:
for _ in range(3):
try:
data = fetch_primary()
except:
data = fetch_backup()
break
Option B:
for _ in range(3):
try:
data = fetch_primary()
break
except:
pass
else:
data = fetch_backup()
Option C:
try:
data = fetch_primary()
except:
data = fetch_backup()
Option D:
while True:
try:
data = fetch_primary()
break
except:
data = fetch_backup()
breakSolution
Step 1: Understand retry and fallback requirements
The agent must retry fetching from primary 3 times, then fallback only if all retries fail.Step 2: Analyze each option's behavior
Retries primary 3 times, then fallback if all fail uses a for loop with try-except and an else clause that runs fallback only if loop completes without break (all retries failed). This matches requirements.Final Answer:
Retries primary 3 times, then fallback if all fail -> Option CQuick Check:
Retry 3 times + fallback after = Retries primary 3 times, then fallback if all fail [OK]
- Running fallback too early
- Not retrying enough times
- Missing else clause for fallback
