For code generation agents, the key metrics are accuracy of generated code, BLEU score or similar text similarity metrics, and functional correctness. Accuracy here means how often the generated code matches the expected solution or passes tests. BLEU score measures how close the generated code is to reference code in wording and structure. Functional correctness is the most important because code must run correctly, not just look similar.
Code generation agent design in Agentic Ai - Model Metrics & Evaluation
Unlike classification, code generation does not use a confusion matrix. Instead, we can think in terms of pass/fail on test cases:
Test Cases: 100
Passed: 85
Failed: 15
Pass Rate = 85/100 = 85%
Fail Rate = 15/100 = 15%
This pass/fail count acts like a simple confusion matrix for code correctness.
In code generation, the tradeoff is between generating code that compiles and runs (precision) and covering all requested features or requirements (recall).
If the agent generates very safe but minimal code, it has high precision (few errors) but low recall (misses features). If it tries to generate complex code covering all features, it may have higher recall but lower precision due to bugs.
Example: A code agent that always generates a simple "Hello World" program has perfect precision but poor recall for complex tasks. One that tries to generate full apps may fail often, lowering precision.
Good: Pass rate above 90%, BLEU score above 0.7, and generated code passes all functional tests.
Bad: Pass rate below 50%, BLEU score below 0.3, and generated code frequently fails to compile or run.
Good code generation means the agent reliably produces working code that meets requirements. Bad means frequent errors or incomplete solutions.
- Overfitting: Agent memorizes training code but fails on new tasks.
- Data Leakage: Test code appears in training data, inflating pass rates.
- Accuracy Paradox: High BLEU score but generated code does not run correctly.
- Ignoring Functional Tests: Relying only on text similarity without running code.
Your code generation agent has 98% accuracy on training data but only 12% pass rate on new test cases. Is it good for production? Why or why not?
Answer: No, it is not good. The high training accuracy suggests overfitting, meaning the agent memorized training code but cannot generate correct new code. The low pass rate on test cases shows poor generalization, which is critical for production use.
