0
0
LangChainframework~5 mins

Creating evaluation datasets in LangChain

Choose your learning style9 modes available
Introduction

We create evaluation datasets to check how well our language models or chains work. It helps us find mistakes and improve them.

When you want to test if your language model answers questions correctly.
Before releasing a chatbot to make sure it understands users well.
To compare different models and pick the best one.
When you add new features and want to see if they work as expected.
To measure progress after training or fine-tuning a model.
Syntax
LangChain
from langchain.evaluation.qa import QAEvalChain
from langchain.schema import Document

# Create a list of evaluation examples
examples = [
    {"query": "What is LangChain?", "answer": "LangChain is a framework for building language model apps."},
    {"query": "Who created LangChain?", "answer": "Harrison Chase created LangChain."}
]

# Convert examples to Documents if needed
docs = [Document(page_content=ex["answer"]) for ex in examples]

# Initialize evaluation chain
eval_chain = QAEvalChain.from_llm(llm)

# Run evaluation
results = eval_chain.evaluate(examples)

The examples list holds questions and expected answers.

Use Document to wrap text if your evaluation chain requires it.

Examples
A simple example with one question and answer pair.
LangChain
examples = [
    {"query": "What is AI?", "answer": "AI means artificial intelligence."}
]
Wrap answers in Document objects for chains that need documents.
LangChain
docs = [Document(page_content=ex["answer"]) for ex in examples]
Create an evaluation chain and run it on your examples.
LangChain
eval_chain = QAEvalChain.from_llm(llm)
results = eval_chain.evaluate(examples)
Sample Program

This program tests the language model's answers against expected ones. It prints the evaluation results showing if answers match well.

LangChain
from langchain.llms import OpenAI
from langchain.evaluation.qa import QAEvalChain

# Initialize language model
llm = OpenAI(temperature=0)

# Prepare evaluation examples
examples = [
    {"query": "What is LangChain?", "answer": "LangChain is a framework for building language model apps."},
    {"query": "Who created LangChain?", "answer": "Harrison Chase created LangChain."}
]

# Create evaluation chain
eval_chain = QAEvalChain.from_llm(llm)

# Run evaluation
results = eval_chain.evaluate(examples)

print(results)
OutputSuccess
Important Notes

Make sure your language model (llm) is properly initialized before evaluation.

Evaluation datasets should have clear, correct answers to get meaningful results.

You can expand examples with more questions to test thoroughly.

Summary

Evaluation datasets help check how well your language model answers questions.

Create examples with queries and expected answers to test your model.

Use LangChain's QAEvalChain to run evaluations easily.