We create evaluation datasets to check how well our language models or chains work. It helps us find mistakes and improve them.
Creating evaluation datasets in LangChain
Start learning this pattern below
Jump into concepts and practice - no test required
or
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Syntax
LangChain
from langchain.evaluation.qa import QAEvalChain from langchain.schema import Document # Create a list of evaluation examples examples = [ {"query": "What is LangChain?", "answer": "LangChain is a framework for building language model apps."}, {"query": "Who created LangChain?", "answer": "Harrison Chase created LangChain."} ] # Convert examples to Documents if needed docs = [Document(page_content=ex["answer"]) for ex in examples] # Initialize evaluation chain eval_chain = QAEvalChain.from_llm(llm) # Run evaluation results = eval_chain.evaluate(examples)
The examples list holds questions and expected answers.
Use Document to wrap text if your evaluation chain requires it.
Examples
LangChain
examples = [
{"query": "What is AI?", "answer": "AI means artificial intelligence."}
]LangChain
docs = [Document(page_content=ex["answer"]) for ex in examples]
LangChain
eval_chain = QAEvalChain.from_llm(llm) results = eval_chain.evaluate(examples)
Sample Program
This program tests the language model's answers against expected ones. It prints the evaluation results showing if answers match well.
LangChain
from langchain.llms import OpenAI from langchain.evaluation.qa import QAEvalChain # Initialize language model llm = OpenAI(temperature=0) # Prepare evaluation examples examples = [ {"query": "What is LangChain?", "answer": "LangChain is a framework for building language model apps."}, {"query": "Who created LangChain?", "answer": "Harrison Chase created LangChain."} ] # Create evaluation chain eval_chain = QAEvalChain.from_llm(llm) # Run evaluation results = eval_chain.evaluate(examples) print(results)
Important Notes
Make sure your language model (llm) is properly initialized before evaluation.
Evaluation datasets should have clear, correct answers to get meaningful results.
You can expand examples with more questions to test thoroughly.
Summary
Evaluation datasets help check how well your language model answers questions.
Create examples with queries and expected answers to test your model.
Use LangChain's QAEvalChain to run evaluations easily.
Practice
1. What is the main purpose of creating evaluation datasets in LangChain?
easy
Solution
Step 1: Understand evaluation datasets
Evaluation datasets contain example questions and expected answers to check model accuracy.Step 2: Identify the purpose in LangChain context
They are used to test how well the model answers, not for training or storage.Final Answer:
To test how well the language model answers specific questions -> Option CQuick Check:
Evaluation datasets = test model accuracy [OK]
Hint: Evaluation datasets check model answers, not train it [OK]
Common Mistakes:
- Confusing evaluation datasets with training data
- Thinking evaluation datasets speed up the model
- Assuming evaluation datasets store user data
2. Which of the following is the correct way to create an evaluation example in LangChain?
easy
Solution
Step 1: Recall LangChain evaluation example format
Evaluation examples are dictionaries with keys like 'query' and 'expected_answer'.Step 2: Match the correct syntax
example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"} uses a dictionary with proper keys, others use tuples, lists, or strings incorrectly.Final Answer:
example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"} -> Option DQuick Check:
Evaluation example = dictionary with keys [OK]
Hint: Use dictionary with 'query' and 'expected_answer' keys [OK]
Common Mistakes:
- Using tuples or lists instead of dictionaries
- Not using correct keys 'query' and 'expected_answer'
- Using plain strings without structure
3. Given the following code snippet, what will be the output?
from langchain.evaluation.qa import QAEvalChain
examples = [{"query": "Capital of France?", "expected_answer": "Paris"}]
chain = QAEvalChain.from_llm(llm=None)
results = chain.evaluate(examples)
print(results)medium
Solution
Step 1: Analyze the QAEvalChain initialization
The method from_llm requires a valid language model instance, not None.Step 2: Predict the error from invalid llm argument
Passing None will cause a TypeError or similar because the chain cannot run without a model.Final Answer:
TypeError because llm=None is invalid -> Option AQuick Check:
Invalid llm argument = TypeError [OK]
Hint: QAEvalChain needs a valid LLM, None causes error [OK]
Common Mistakes:
- Assuming None is a valid LLM
- Expecting output without running the model
- Ignoring required imports or parameters
4. You wrote this code to create evaluation examples but get an error:
examples = [{"query": "Who wrote Hamlet?", "answer": "Shakespeare"}]
chain = QAEvalChain.from_llm(llm=some_llm)
results = chain.evaluate(examples)
print(results)
What is the likely cause of the error?medium
Solution
Step 1: Check example dictionary keys
LangChain expects 'expected_answer' key, not 'answer', for evaluation examples.Step 2: Identify mismatch causing error
Using 'answer' instead of 'expected_answer' causes the chain to fail reading expected answers.Final Answer:
The key 'answer' should be 'expected_answer' in the example dictionary -> Option BQuick Check:
Correct key name = 'expected_answer' [OK]
Hint: Use 'expected_answer' key, not 'answer' in examples [OK]
Common Mistakes:
- Using wrong key names in example dictionaries
- Assuming method names without checking docs
- Ignoring variable definitions
5. You want to create an evaluation dataset with multiple examples and run QAEvalChain to check model accuracy. Which approach correctly prepares and evaluates the dataset?
hard
Solution
Step 1: Format evaluation dataset correctly
LangChain expects a list of dictionaries with keys 'query' and 'expected_answer' for evaluation.Step 2: Use the correct method to evaluate
The QAEvalChain uses the evaluate() method to process multiple examples at once.Final Answer:
Prepare a list of dictionaries with 'query' and 'expected_answer', then call chain.evaluate(examples) -> Option AQuick Check:
List of dicts + evaluate() = correct approach [OK]
Hint: Use list of dicts with evaluate() method for multiple examples [OK]
Common Mistakes:
- Using tuples or dicts with wrong structure
- Calling run() instead of evaluate() for batch evaluation
- Passing strings instead of structured data
