For code generation models, the main goal is to produce correct and useful code. Metrics like BLEU and CodeBLEU measure how closely the generated code matches reference code. However, these only check similarity, not correctness.
Therefore, functional correctness is key. This means the generated code runs without errors and produces the expected results. We often use pass@k which measures if at least one of k generated code snippets passes all tests.
In summary, functional correctness metrics matter most because they show if the code actually works, not just if it looks similar.