Automated evaluation pipelines help check how well your language models or chains work without doing it by hand. They save time and make sure your results are reliable.
Automated evaluation pipelines in LangChain
from langchain.evaluation.qa import QAEvalChain # Create an evaluation chain with a model and criteria evaluation_chain = QAEvalChain.from_llm( llm=your_llm, question_key="question", answer_key="answer", reference_key="reference" ) # Run evaluation on a list of examples results = evaluation_chain.evaluate(examples)
The QAEvalChain runs your model and compares outputs to references automatically.
You provide examples with inputs, expected outputs, and the model's outputs to get scores.
from langchain.evaluation.qa import QAEvalChain # Simple evaluation chain setup evaluation_chain = QAEvalChain.from_llm( llm=my_llm, question_key="question", answer_key="answer", reference_key="correct_answer" ) results = evaluation_chain.evaluate([ {"question": "What is 2+2?", "answer": "4", "correct_answer": "4"}, {"question": "Capital of France?", "answer": "Paris", "correct_answer": "Paris"} ])
from langchain.evaluation.qa import QAEvalChain # Using a custom evaluation prompt custom_eval = QAEvalChain.from_llm( llm=my_llm, question_key="input_text", answer_key="generated_text", reference_key="expected_text" ) results = custom_eval.evaluate(examples)
This program sets up a simple evaluation pipeline that checks if the model's answers match the correct answers for a few questions.
from langchain.llms import OpenAI from langchain.evaluation.qa import QAEvalChain # Initialize a language model llm = OpenAI(model_name="gpt-4", temperature=0) # Create evaluation chain evaluation_chain = QAEvalChain.from_llm( llm=llm, question_key="question", answer_key="answer", reference_key="correct_answer" ) # Define examples to evaluate examples = [ {"question": "What is the capital of Italy?", "answer": "Rome", "correct_answer": "Rome"}, {"question": "What color is the sky?", "answer": "Blue", "correct_answer": "Blue"}, {"question": "2 + 2 equals?", "answer": "4", "correct_answer": "4"} ] # Run evaluation results = evaluation_chain.evaluate(examples) print(results)
Make sure your examples have matching keys for input, prediction, and reference.
Evaluation pipelines can be extended with custom metrics or prompts for more complex checks.
Use low temperature in your LLM during evaluation to get consistent outputs.
Automated evaluation pipelines help test your language models quickly and reliably.
You set them up by linking inputs, model outputs, and expected answers.
They save time and improve your AI system's quality by catching errors early.