Bird
Raised Fist0
LangChainframework~20 mins

Custom evaluation metrics in LangChain - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Custom Evaluation Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
component_behavior
intermediate
2:00remaining
What output does this custom metric function produce?

Consider this Python function used as a custom evaluation metric in Langchain:

def custom_metric(predictions, references):
    correct = sum(p == r for p, r in zip(predictions, references))
    total = len(references)
    return correct / total if total > 0 else 0

What is the output of custom_metric(['a', 'b', 'c'], ['a', 'x', 'c'])?

LangChain
def custom_metric(predictions, references):
    correct = sum(p == r for p, r in zip(predictions, references))
    total = len(references)
    return correct / total if total > 0 else 0

result = custom_metric(['a', 'b', 'c'], ['a', 'x', 'c'])
A0
B0.3333333333333333
C0.6666666666666666
D1.0
Attempts:
2 left
💡 Hint

Count how many predictions match the references exactly, then divide by total items.

📝 Syntax
intermediate
2:00remaining
Which option correctly defines a custom metric function in Langchain?

Which of the following Python functions correctly defines a custom evaluation metric that returns the ratio of matching items between predictions and references?

A
def metric(predictions, references):
    return sum(p == r for p, r in zip(predictions, references)) * len(references)
B
def metric(predictions, references):
    return sum(p == r for p in predictions for r in references) / len(references)
C
def metric(predictions, references):
    return sum(predictions == references) / len(references)
D
def metric(predictions, references):
    return sum(p == r for p, r in zip(predictions, references)) / len(references)
Attempts:
2 left
💡 Hint

Use zip to pair predictions and references correctly.

🔧 Debug
advanced
2:00remaining
What error does this custom metric code raise?

Given this custom metric function:

def metric(predictions, references):
    return sum(p == r for p, r in zip(predictions, references)) / len(predictions)

What error will occur if predictions is an empty list and references is non-empty?

LangChain
def metric(predictions, references):
    return sum(p == r for p, r in zip(predictions, references)) / len(predictions)

result = metric([], ['a', 'b'])
AZeroDivisionError
BIndexError
CTypeError
DNo error, returns 0
Attempts:
2 left
💡 Hint

Check what happens when dividing by the length of an empty list.

🧠 Conceptual
advanced
2:00remaining
Why use custom evaluation metrics in Langchain?

Which reason best explains why you might create a custom evaluation metric instead of using built-in ones in Langchain?

ATo measure specific qualities of your model's output that built-in metrics don't capture
BBecause built-in metrics are always inaccurate and unreliable
CBecause Langchain requires custom metrics for all models
DTo make the evaluation run faster by avoiding built-in functions
Attempts:
2 left
💡 Hint

Think about why general metrics might not fit every use case.

state_output
expert
3:00remaining
What is the final value of score after running this custom metric?

Consider this code snippet used in Langchain to evaluate predictions:

class CustomMetric:
    def __init__(self):
        self.correct = 0
        self.total = 0
    def update(self, predictions, references):
        for p, r in zip(predictions, references):
            if p == r:
                self.correct += 1
            self.total += 1
    def compute(self):
        return self.correct / self.total if self.total > 0 else 0

metric = CustomMetric()
metric.update(['a', 'b'], ['a', 'x'])
metric.update(['c'], ['c'])
score = metric.compute()

What is the value of score?

LangChain
class CustomMetric:
    def __init__(self):
        self.correct = 0
        self.total = 0
    def update(self, predictions, references):
        for p, r in zip(predictions, references):
            if p == r:
                self.correct += 1
            self.total += 1
    def compute(self):
        return self.correct / self.total if self.total > 0 else 0

metric = CustomMetric()
metric.update(['a', 'b'], ['a', 'x'])
metric.update(['c'], ['c'])
score = metric.compute()
A0.5
B0.6666666666666666
C1.0
D0.0
Attempts:
2 left
💡 Hint

Count total matches and total items after both updates.

Practice

(1/5)
1. What is the main purpose of creating a custom evaluation metric in Langchain?
easy
A. To speed up the AI model training process
B. To measure AI results in a way that fits your specific needs
C. To automatically fix errors in AI outputs
D. To replace the AI model with a simpler one

Solution

  1. Step 1: Understand the role of evaluation metrics

    Evaluation metrics measure how well an AI model performs its task.
  2. Step 2: Identify why custom metrics are used

    Custom metrics let you measure results in ways that standard metrics might not cover, fitting your unique needs.
  3. Final Answer:

    To measure AI results in a way that fits your specific needs -> Option B
  4. Quick Check:

    Custom metrics = tailored measurement [OK]
Hint: Custom metrics tailor scoring to your AI task [OK]
Common Mistakes:
  • Thinking custom metrics speed training
  • Believing they fix AI errors automatically
  • Confusing metrics with model replacement
2. Which of the following is the correct way to start defining a custom evaluation metric class in Langchain?
easy
A. class MyMetric(Evaluation):
B. def MyMetric():
C. class MyMetric():
D. function MyMetric extends Evaluation {}

Solution

  1. Step 1: Recall Langchain class inheritance syntax

    Custom metrics inherit from the Evaluation base class using Python class syntax.
  2. Step 2: Identify correct class definition

    class MyMetric(Evaluation): correctly defines a class inheriting from Evaluation, matching Langchain patterns.
  3. Final Answer:

    class MyMetric(Evaluation): -> Option A
  4. Quick Check:

    Class inherits Evaluation = correct syntax [OK]
Hint: Use class inheritance with Evaluation base [OK]
Common Mistakes:
  • Defining a function instead of a class
  • Missing inheritance from Evaluation
  • Using JavaScript syntax in Python
3. Given this custom metric class, what will metric.evaluate(['hello'], ['hello']) return?
class ExactMatch(Evaluation):
    def evaluate(self, predictions, references):
        return sum(p == r for p, r in zip(predictions, references)) / len(references)
medium
A. 1.0
B. 0.0
C. Error due to missing method
D. None

Solution

  1. Step 1: Understand the evaluate method logic

    It compares each prediction to the reference and counts matches, then divides by total references.
  2. Step 2: Apply inputs to the method

    With predictions=['hello'] and references=['hello'], the single pair matches, so sum is 1 and length is 1, result is 1/1 = 1.0.
  3. Final Answer:

    1.0 -> Option A
  4. Quick Check:

    Exact match count / total = 1.0 [OK]
Hint: Check if predictions equal references, then divide [OK]
Common Mistakes:
  • Forgetting to divide by length
  • Confusing sum with boolean values
  • Expecting method to return a list
4. What is wrong with this custom metric class that causes an error?
class LengthDiff(Evaluation):
    def evaluate(self, predictions, references):
        return abs(len(predictions) - len(references)) / len(references)
medium
A. It returns a number instead of a score between 0 and 1
B. It does not implement the evaluate method
C. It uses abs() incorrectly causing a syntax error
D. It does not handle empty lists causing runtime error

Solution

  1. Step 1: Analyze the evaluate method with empty references

    If references=[], len(references)=0 causes ZeroDivisionError in the division.
  2. Step 2: Identify the runtime error cause

    The code divides by len(references) without checking if references is empty, causing runtime error.
  3. Final Answer:

    It does not handle empty lists causing runtime error -> Option D
  4. Quick Check:

    len(references)==0 -> ZeroDivisionError [OK]
Hint: Check how method handles empty input lists [OK]
Common Mistakes:
  • Assuming abs() causes syntax error
  • Thinking evaluate method is missing
  • Ignoring empty list edge cases
5. You want to create a custom metric that scores AI answers higher if they contain more keywords from a reference list. Which approach fits best?
hard
A. Calculate the difference in length between prediction and reference
B. Check if prediction exactly matches the reference string
C. Count how many keywords appear in the prediction, divide by total keywords
D. Return a fixed score regardless of prediction content

Solution

  1. Step 1: Understand the goal of keyword-based scoring

    The metric should reward predictions containing more keywords from the reference list.
  2. Step 2: Identify the approach that measures keyword presence proportionally

    Counting keywords in prediction and dividing by total keywords gives a score reflecting keyword coverage.
  3. Final Answer:

    Count how many keywords appear in the prediction, divide by total keywords -> Option C
  4. Quick Check:

    Keyword coverage scoring = Count how many keywords appear in the prediction, divide by total keywords [OK]
Hint: Score by keyword matches divided by total keywords [OK]
Common Mistakes:
  • Using exact match instead of keyword count
  • Measuring length difference unrelated to keywords
  • Returning fixed scores ignoring content