We create evaluation datasets to check how well our language models or chains work. It helps us find mistakes and improve them.
0
0
Creating evaluation datasets in LangChain
Introduction
When you want to test if your language model answers questions correctly.
Before releasing a chatbot to make sure it understands users well.
To compare different models and pick the best one.
When you add new features and want to see if they work as expected.
To measure progress after training or fine-tuning a model.
Syntax
LangChain
from langchain.evaluation.qa import QAEvalChain from langchain.schema import Document # Create a list of evaluation examples examples = [ {"query": "What is LangChain?", "answer": "LangChain is a framework for building language model apps."}, {"query": "Who created LangChain?", "answer": "Harrison Chase created LangChain."} ] # Convert examples to Documents if needed docs = [Document(page_content=ex["answer"]) for ex in examples] # Initialize evaluation chain eval_chain = QAEvalChain.from_llm(llm) # Run evaluation results = eval_chain.evaluate(examples)
The examples list holds questions and expected answers.
Use Document to wrap text if your evaluation chain requires it.
Examples
A simple example with one question and answer pair.
LangChain
examples = [
{"query": "What is AI?", "answer": "AI means artificial intelligence."}
]Wrap answers in Document objects for chains that need documents.
LangChain
docs = [Document(page_content=ex["answer"]) for ex in examples]
Create an evaluation chain and run it on your examples.
LangChain
eval_chain = QAEvalChain.from_llm(llm) results = eval_chain.evaluate(examples)
Sample Program
This program tests the language model's answers against expected ones. It prints the evaluation results showing if answers match well.
LangChain
from langchain.llms import OpenAI from langchain.evaluation.qa import QAEvalChain # Initialize language model llm = OpenAI(temperature=0) # Prepare evaluation examples examples = [ {"query": "What is LangChain?", "answer": "LangChain is a framework for building language model apps."}, {"query": "Who created LangChain?", "answer": "Harrison Chase created LangChain."} ] # Create evaluation chain eval_chain = QAEvalChain.from_llm(llm) # Run evaluation results = eval_chain.evaluate(examples) print(results)
OutputSuccess
Important Notes
Make sure your language model (llm) is properly initialized before evaluation.
Evaluation datasets should have clear, correct answers to get meaningful results.
You can expand examples with more questions to test thoroughly.
Summary
Evaluation datasets help check how well your language model answers questions.
Create examples with queries and expected answers to test your model.
Use LangChain's QAEvalChain to run evaluations easily.