For autonomous web browsing agents, key metrics include task success rate, precision, and recall. Task success rate measures how often the agent completes the intended browsing task correctly, such as finding information or filling forms. Precision tells us how many of the agent's actions were correct out of all actions it took, avoiding unnecessary or wrong clicks. Recall shows how many needed actions the agent actually performed, ensuring it does not miss important steps. These metrics matter because the agent must be both accurate and thorough to be useful and safe.
Autonomous web browsing agents in Agentic Ai - Model Metrics & Evaluation
Actions Taken by Agent
+----------------+----------------+----------------+
| | Action Correct | Action Wrong |
+----------------+----------------+----------------+
| Action Needed | True Positive | False Negative |
| Action Not Needed | False Positive | True Negative |
+----------------+----------------+----------------+
Where:
- TP: Agent correctly performed a needed action.
- FP: Agent performed an unnecessary or wrong action.
- FN: Agent missed a needed action.
- TN: Agent correctly avoided unnecessary actions.
Metrics:
Precision = TP / (TP + FP)
Recall = TP / (TP + FN)
Task Success Rate = Number of tasks completed correctly / Total tasks
If the agent has high precision but low recall, it means it rarely makes wrong moves but often misses important steps. For example, it clicks only when very sure but may skip filling some form fields, causing incomplete tasks.
If the agent has high recall but low precision, it tries to do all needed actions but also many wrong ones. For example, it clicks many buttons, including irrelevant ones, which may cause errors or slow performance.
Balancing precision and recall is important: the agent should do all necessary actions (high recall) but avoid mistakes (high precision) to complete tasks efficiently and correctly.
- Good: Task success rate above 90%, precision and recall both above 85%. The agent completes tasks reliably with few mistakes or missed steps.
- Bad: Task success rate below 60%, precision or recall below 50%. The agent often fails tasks, clicks wrong elements, or misses important actions.
- Accuracy paradox: High overall accuracy can be misleading if the agent mostly does nothing and avoids errors but also never completes tasks.
- Data leakage: Training the agent on test websites can inflate metrics but fail in real browsing scenarios.
- Overfitting: Agent performs well on known sites but poorly on new or dynamic pages.
- Ignoring user experience: Metrics may not capture delays or confusing agent behavior that frustrates users.
Your autonomous web browsing agent has 98% accuracy but only 12% recall on needed actions. Is it good for production? Why or why not?
Answer: No, it is not good. The agent rarely makes mistakes (high accuracy) but misses most needed actions (very low recall). This means it often fails to complete tasks, making it unreliable despite high accuracy.
