Custom evaluation metrics help you measure how well your AI or language model is doing in ways that matter most to your project.
Custom evaluation metrics in LangChain
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
LangChain
from langchain.evaluation import Evaluation class MyMetric(Evaluation): def evaluate(self, prediction: str, reference: str) -> float: # Your custom logic here score = 0.0 return score
Create a class that inherits from
Evaluation.Implement the
evaluate method to return a numeric score.Examples
LangChain
from langchain.evaluation import Evaluation class ExactMatch(Evaluation): def evaluate(self, prediction: str, reference: str) -> float: return 1.0 if prediction == reference else 0.0
LangChain
from langchain.evaluation import Evaluation class LengthDifference(Evaluation): def evaluate(self, prediction: str, reference: str) -> float: return 1.0 / (1 + abs(len(prediction) - len(reference)))
Sample Program
This example defines a simple similarity metric that compares how many words overlap between prediction and reference. It then prints the similarity score.
LangChain
from langchain.evaluation import Evaluation class SimpleSimilarity(Evaluation): def evaluate(self, prediction: str, reference: str) -> float: pred_words = set(prediction.lower().split()) ref_words = set(reference.lower().split()) common = pred_words.intersection(ref_words) total = pred_words.union(ref_words) return len(common) / len(total) if total else 0.0 # Example usage metric = SimpleSimilarity() pred = "The quick brown fox" ref = "The quick fox jumps" score = metric.evaluate(pred, ref) print(f"Similarity score: {score:.2f}")
Important Notes
Custom metrics should return a number, usually between 0 and 1, where higher means better.
Keep your metric logic simple and fast for better performance.
Test your metric with different inputs to make sure it behaves as expected.
Summary
Custom evaluation metrics let you measure AI results in your own way.
Define a class inheriting from Evaluation and implement evaluate.
Use your metric to get scores that help improve your AI models.
Practice
1. What is the main purpose of creating a custom evaluation metric in Langchain?
easy
Solution
Step 1: Understand the role of evaluation metrics
Evaluation metrics measure how well an AI model performs its task.Step 2: Identify why custom metrics are used
Custom metrics let you measure results in ways that standard metrics might not cover, fitting your unique needs.Final Answer:
To measure AI results in a way that fits your specific needs -> Option BQuick Check:
Custom metrics = tailored measurement [OK]
Hint: Custom metrics tailor scoring to your AI task [OK]
Common Mistakes:
- Thinking custom metrics speed training
- Believing they fix AI errors automatically
- Confusing metrics with model replacement
2. Which of the following is the correct way to start defining a custom evaluation metric class in Langchain?
easy
Solution
Step 1: Recall Langchain class inheritance syntax
Custom metrics inherit from the Evaluation base class using Python class syntax.Step 2: Identify correct class definition
class MyMetric(Evaluation): correctly defines a class inheriting from Evaluation, matching Langchain patterns.Final Answer:
class MyMetric(Evaluation): -> Option AQuick Check:
Class inherits Evaluation = correct syntax [OK]
Hint: Use class inheritance with Evaluation base [OK]
Common Mistakes:
- Defining a function instead of a class
- Missing inheritance from Evaluation
- Using JavaScript syntax in Python
3. Given this custom metric class, what will
metric.evaluate(['hello'], ['hello']) return?
class ExactMatch(Evaluation):
def evaluate(self, predictions, references):
return sum(p == r for p, r in zip(predictions, references)) / len(references)medium
Solution
Step 1: Understand the evaluate method logic
It compares each prediction to the reference and counts matches, then divides by total references.Step 2: Apply inputs to the method
With predictions=['hello'] and references=['hello'], the single pair matches, so sum is 1 and length is 1, result is 1/1 = 1.0.Final Answer:
1.0 -> Option AQuick Check:
Exact match count / total = 1.0 [OK]
Hint: Check if predictions equal references, then divide [OK]
Common Mistakes:
- Forgetting to divide by length
- Confusing sum with boolean values
- Expecting method to return a list
4. What is wrong with this custom metric class that causes an error?
class LengthDiff(Evaluation):
def evaluate(self, predictions, references):
return abs(len(predictions) - len(references)) / len(references)medium
Solution
Step 1: Analyze the evaluate method with empty references
If references=[], len(references)=0 causes ZeroDivisionError in the division.Step 2: Identify the runtime error cause
The code divides by len(references) without checking if references is empty, causing runtime error.Final Answer:
It does not handle empty lists causing runtime error -> Option DQuick Check:
len(references)==0 -> ZeroDivisionError [OK]
Hint: Check how method handles empty input lists [OK]
Common Mistakes:
- Assuming abs() causes syntax error
- Thinking evaluate method is missing
- Ignoring empty list edge cases
5. You want to create a custom metric that scores AI answers higher if they contain more keywords from a reference list. Which approach fits best?
hard
Solution
Step 1: Understand the goal of keyword-based scoring
The metric should reward predictions containing more keywords from the reference list.Step 2: Identify the approach that measures keyword presence proportionally
Counting keywords in prediction and dividing by total keywords gives a score reflecting keyword coverage.Final Answer:
Count how many keywords appear in the prediction, divide by total keywords -> Option CQuick Check:
Keyword coverage scoring = Count how many keywords appear in the prediction, divide by total keywords [OK]
Hint: Score by keyword matches divided by total keywords [OK]
Common Mistakes:
- Using exact match instead of keyword count
- Measuring length difference unrelated to keywords
- Returning fixed scores ignoring content
