For chains that combine multiple models or steps, the key metric is overall accuracy or task success rate. This shows how well the entire chain completes the goal. For router chains, routing accuracy is also important to check if the right model is chosen for each input. We care about these because a chain is only as good as its weakest step or wrong routing.
Chains (sequential, router) in Prompt Engineering / GenAI - Model Metrics & Evaluation
| Actual Model Needed | Predicted Model Chosen |
|---------------------|-----------------------|
| Model A | TP_A (correct) |
| Model B | FP_A (wrongly chosen) |
| Model B | TP_B (correct) |
| Model A | FP_B (wrongly chosen) |
Total samples = TP_A + FP_A + TP_B + FP_B
Precision for Model A = TP_A / (TP_A + FP_A)
Recall for Model A = TP_A / (TP_A + FN_A)
This matrix helps measure if the router picks the right model for each input.
If the router has high precision but low recall for a model, it means it rarely picks that model wrongly but often misses inputs that need it. This can cause poor results if some inputs never reach the best model.
If recall is high but precision is low, the router picks the model often but sometimes wrongly, causing unnecessary processing or errors.
For sequential chains, a tradeoff is between speed and accuracy: adding more steps can improve accuracy but slow down the chain.
- Good: Overall accuracy above 90%, router precision and recall above 85%, smooth step transitions without errors.
- Bad: Overall accuracy below 70%, router precision or recall below 50%, frequent step failures or wrong routing causing wrong outputs.
- Ignoring step errors: A chain may have good final accuracy but some steps fail silently, causing hidden issues.
- Data leakage: Training router or steps on overlapping data can inflate metrics falsely.
- Overfitting: Router or steps tuned too much on training data may fail on new inputs.
- Accuracy paradox: High accuracy can hide poor performance on rare but important cases.
Your router chain has 98% overall accuracy but only 12% recall for a critical model in the chain. Is it good for production? Why or why not?
Answer: No, it is not good. The low recall means the router misses most inputs that need the critical model. This can cause many inputs to be handled incorrectly, hurting overall performance despite high accuracy.