The ReAct pattern combines reasoning and acting steps in AI models to improve decision-making. To evaluate it, we focus on accuracy and task success rate. Accuracy shows how often the model's final answers are correct. Task success rate measures if the model completes the intended task using its reasoning and actions. These metrics matter because ReAct aims to improve both understanding and execution, so we want to see if the model reasons well and acts correctly.
ReAct pattern in Prompt Engineering / GenAI - Model Metrics & Evaluation
Confusion Matrix for ReAct model task completion:
Predicted Success Predicted Failure
Actual Success 85 (TP) 15 (FN)
Actual Failure 10 (FP) 90 (TN)
Total samples = 200
Precision = TP / (TP + FP) = 85 / (85 + 10) = 0.8947
Recall = TP / (TP + FN) = 85 / (85 + 15) = 0.85
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) = 0.871
This matrix shows how well the ReAct model predicts successful task completion. High precision means most predicted successes are true. High recall means most actual successes are caught.
In ReAct models, precision and recall balance is key:
- High Precision: The model rarely claims success unless very sure. Good when false success is costly, like medical advice generation.
- High Recall: The model tries to catch all successes, even if some are wrong. Useful when missing a success is worse, like emergency response planning.
Choosing which to prioritize depends on the task. For example, a ReAct model helping with legal advice should have high precision to avoid wrong guidance. A ReAct model for search and rescue should have high recall to not miss any possible success.
Good metrics:
- Accuracy above 85%
- Precision and recall both above 80%
- F1 score close to or above 85%
- Consistent task success rate across different inputs
Bad metrics:
- Accuracy below 70%
- Precision or recall below 50%
- Large gap between precision and recall (e.g., precision 90% but recall 30%)
- Unstable task success rate, failing often on new inputs
Good metrics mean the ReAct model reasons and acts reliably. Bad metrics show it struggles to balance reasoning and action, leading to wrong or missed results.
- Accuracy paradox: High accuracy can be misleading if data is imbalanced (e.g., mostly failures). Always check precision and recall.
- Data leakage: If the model sees answers during training, metrics will be unrealistically high.
- Overfitting: Model performs well on training but poorly on new tasks, hiding in high training accuracy.
- Ignoring task complexity: Metrics alone don't show if reasoning steps are meaningful or just memorized.
- Not measuring intermediate reasoning quality: Only final output metrics miss how well the model reasons before acting.
Your ReAct model has 98% accuracy but only 12% recall on successful task completions. Is it good for production? Why or why not?
Answer: No, it is not good. The very low recall means the model misses most actual successes, even if it rarely makes false success claims. This means many tasks that should succeed are not recognized, which can be critical depending on the application. High accuracy alone is misleading here.