Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Code generation in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Code generation
Which metric matters for Code Generation and WHY

For code generation models, the main goal is to produce correct and useful code. Metrics like BLEU and CodeBLEU measure how closely the generated code matches reference code. However, these only check similarity, not correctness.

Therefore, functional correctness is key. This means the generated code runs without errors and produces the expected results. We often use pass@k which measures if at least one of k generated code snippets passes all tests.

In summary, functional correctness metrics matter most because they show if the code actually works, not just if it looks similar.

Confusion Matrix or Equivalent Visualization

Code generation is not a classification task, so confusion matrices don't apply directly. Instead, we use pass@k metrics.

pass@1 = Number of problems solved by the first generated code / Total problems
pass@5 = Number of problems solved by any of the 5 generated codes / Total problems
    

Example:

Total problems: 100
pass@1: 60 (60% success rate)
pass@5: 85 (85% success rate)
    

This shows how often the model generates at least one correct solution among multiple tries.

Precision vs Recall Tradeoff (or Equivalent) with Examples

In code generation, precision and recall don't apply like in classification. Instead, there is a tradeoff between generating many code options (diversity) and generating correct code (accuracy).

If the model generates only one code snippet (low diversity), it might miss the correct solution (low pass@k). If it generates many snippets (high diversity), it may include more correct ones but also more incorrect ones.

Example:

  • Generating 1 snippet: 60% pass rate
  • Generating 10 snippets: 95% pass rate

This shows generating more options increases chances of correctness but costs more computation.

What "Good" vs "Bad" Metric Values Look Like for Code Generation

Good:

  • High pass@1 (e.g., > 70%) means the first generated code is often correct.
  • High pass@5 or pass@10 (e.g., > 90%) means the model reliably produces a correct solution within a few tries.
  • Low syntax errors and runtime errors in generated code.

Bad:

  • Low pass@1 (e.g., < 30%) means the model rarely gets it right on the first try.
  • Low pass@k even for large k means the model struggles to generate any correct code.
  • High rate of code that does not compile or crashes.
Common Metrics Pitfalls in Code Generation
  • Relying only on similarity metrics: BLEU or CodeBLEU can be high even if code is incorrect or does not run.
  • Ignoring functional correctness: Code that looks good but fails tests is useless.
  • Overfitting to test cases: Models might memorize solutions instead of generalizing.
  • Data leakage: If test problems appear in training data, metrics will be misleadingly high.
  • Ignoring diversity: Generating only one code snippet can hide the model's ability to find correct solutions among multiple tries.
Self Check

Your code generation model has 98% accuracy by BLEU score but only 12% pass@1. Is it good for production? Why or why not?

Answer: No, it is not good. The high BLEU score means the generated code looks similar to reference code, but the very low pass@1 means the code rarely runs correctly on the first try. For production, functional correctness (pass@1) matters more than similarity.

Key Result
Functional correctness metrics like pass@k are key to evaluating code generation quality, as similarity scores alone can be misleading.

Practice

(1/5)
1. What is the main purpose of code generation in AI?
easy
A. Manually write code faster
B. Automatically create code from instructions
C. Run code without errors
D. Delete unnecessary code

Solution

  1. Step 1: Understand code generation meaning

    Code generation means creating code automatically from instructions or examples.
  2. Step 2: Match purpose with options

    Automatically create code from instructions correctly states this purpose, others describe different tasks.
  3. Final Answer:

    Automatically create code from instructions -> Option B
  4. Quick Check:

    Code generation = automatic code creation [OK]
Hint: Code generation means automatic code writing [OK]
Common Mistakes:
  • Confusing code generation with manual coding
  • Thinking code generation fixes errors automatically
  • Believing code generation deletes code
2. Which of the following is the correct Python syntax to define a function named generate_code?
easy
A. generate_code def():
B. function generate_code()
C. def generate_code[]:
D. def generate_code():

Solution

  1. Step 1: Recall Python function syntax

    Python functions start with def, followed by name and parentheses, then colon.
  2. Step 2: Check each option

    def generate_code(): matches correct syntax; A, B and D have syntax errors (A wrong order, B JavaScript style, D brackets).
  3. Final Answer:

    def generate_code(): -> Option D
  4. Quick Check:

    Python function = def name(): [OK]
Hint: Python functions start with def and parentheses [OK]
Common Mistakes:
  • Using JavaScript function keyword in Python
  • Missing parentheses after function name
  • Using brackets instead of parentheses
3. What will be the output of this Python code generated by AI?
def add_numbers(a, b):
    return a + b

result = add_numbers(3, 4)
print(result)
medium
A. 7
B. 34
C. TypeError
D. None

Solution

  1. Step 1: Understand function behavior

    The function adds two numbers and returns the sum.
  2. Step 2: Calculate add_numbers(3, 4)

    3 + 4 equals 7, so result is 7 and printed.
  3. Final Answer:

    7 -> Option A
  4. Quick Check:

    3 + 4 = 7 [OK]
Hint: Adding numbers returns their sum [OK]
Common Mistakes:
  • Thinking + concatenates numbers as strings
  • Expecting error from simple addition
  • Confusing return value with print output
4. Identify the error in this AI-generated Python code:
def multiply(x, y):
return x * y

print(multiply(2, 3))
medium
A. Missing indentation for return statement
B. Wrong function name
C. Missing parentheses in print
D. Using * instead of + operator

Solution

  1. Step 1: Check Python indentation rules

    Python requires the return line inside function to be indented.
  2. Step 2: Identify error in code

    Return is not indented, causing IndentationError; other options are incorrect.
  3. Final Answer:

    Missing indentation for return statement -> Option A
  4. Quick Check:

    Python needs indented blocks [OK]
Hint: Indent inside functions in Python [OK]
Common Mistakes:
  • Ignoring indentation errors
  • Thinking print needs no parentheses in Python 3
  • Confusing operators without context
5. You want to generate Python code that creates a dictionary from a list of keys ["a", "b", "c"] with values as their lengths. Which code snippet correctly uses dictionary comprehension?
hard
A. result = {len(k): k for k in ["a", "b", "c"]}
B. result = [k: len(k) for k in ["a", "b", "c"]]
C. result = {k: len(k) for k in ["a", "b", "c"]}
D. result = {k, len(k) for k in ["a", "b", "c"]}

Solution

  1. Step 1: Understand dictionary comprehension syntax

    It uses curly braces with key:value pairs inside a for loop.
  2. Step 2: Check each option

    result = {k: len(k) for k in ["a", "b", "c"]} correctly creates dict with keys and their lengths; B uses list brackets wrongly; C swaps key and value; D uses comma instead of colon.
  3. Final Answer:

    result = {k: len(k) for k in ["a", "b", "c"]} -> Option C
  4. Quick Check:

    Dict comprehension = {key: value for item} [OK]
Hint: Dict comprehension uses {key: value for item} [OK]
Common Mistakes:
  • Using list brackets [] instead of {}
  • Swapping keys and values
  • Using comma instead of colon in dict