Bird
Raised Fist0
LangChainframework~10 mins

Creating evaluation datasets in LangChain - Visual Walkthrough

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Concept Flow - Creating evaluation datasets
Define dataset structure
Load raw data source
Process and clean data
Split data into train/test/eval
Format data for evaluation
Save or return evaluation dataset
This flow shows how to create an evaluation dataset by defining, loading, processing, splitting, formatting, and saving data.
Execution Sample
LangChain
from langchain.evaluation import Dataset

raw_data = load_data()
dataset = Dataset.from_list(raw_data)
train_set, test_set, eval_set = dataset.split(0.8, 0.1, 0.1)
formatted_eval = eval_set.format_for_evaluation()
This code loads raw data, creates a Dataset, splits it, and formats it for evaluation.
Execution Table
StepActionInputOutputNotes
1Call load_data()NoneList of raw data itemsRaw data loaded from source
2Create Dataset from listRaw data listDataset object with all dataDataset initialized
3Split datasetDataset objectTrain, Test, Eval subsetsSplit ratios 80%,10%,10%
4Format eval subsetEval subsetFormatted evaluation dataReady for evaluation use
5Return formatted eval dataFormatted dataEvaluation dataset outputProcess complete
💡 All steps complete, evaluation dataset ready for use
Variable Tracker
VariableStartAfter Step 1After Step 2After Step 3After Step 4Final
raw_dataNoneList of raw itemsList of raw itemsList of raw itemsList of raw itemsList of raw items
datasetNoneNoneDataset objectDataset objectDataset objectDataset object
train_setNoneNoneNoneTrain subsetTrain subsetTrain subset
test_setNoneNoneNoneTest subsetTest subsetTest subset
eval_setNoneNoneNoneEval subsetEval subsetEval subset
formatted_evalNoneNoneNoneNoneFormatted eval dataFormatted eval data
Key Moments - 3 Insights
Why do we split the dataset into train, test, and eval parts?
Splitting ensures we train on one part, test on another, and evaluate on a separate set to fairly measure performance, as shown in step 3 of the execution_table.
What does formatting the evaluation data do?
Formatting prepares the data in a way the evaluation tools expect, making it usable for scoring or comparison, as seen in step 4.
Can we create an evaluation dataset without cleaning or processing raw data?
Skipping processing may cause errors or poor evaluation quality. Processing ensures data is consistent and clean before splitting and formatting.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution_table, what is the output after step 3?
AA single Dataset object with all data
BTrain, Test, Eval subsets
CFormatted evaluation data
DRaw data list
💡 Hint
Check the Output column for step 3 in the execution_table
According to variable_tracker, when does 'formatted_eval' get its value?
AAfter Step 2
BAfter Step 3
CAfter Step 4
DAt Start
💡 Hint
Look at the 'formatted_eval' row and see when it changes from None
If we skip splitting the dataset, how would the execution_table change?
AStep 3 would be missing or output the full dataset
BStep 4 would not exist
CStep 3 would output formatted evaluation data
DStep 1 would fail
💡 Hint
Consider what splitting does at step 3 and what happens if it's skipped
Concept Snapshot
Creating evaluation datasets in Langchain:
1. Load raw data
2. Create Dataset object
3. Split into train/test/eval
4. Format eval subset
5. Use formatted data for evaluation
Splitting ensures fair testing and evaluation.
Full Transcript
Creating evaluation datasets involves loading raw data, wrapping it in a Dataset object, splitting it into training, testing, and evaluation parts, then formatting the evaluation subset for use. This process helps measure model performance fairly by separating data for training and evaluation. The key steps include loading data, splitting with defined ratios, and formatting for evaluation tools. Variables like raw_data, dataset, and formatted_eval change state as the process moves forward. Understanding why splitting and formatting happen helps avoid confusion and ensures good evaluation results.

Practice

(1/5)
1. What is the main purpose of creating evaluation datasets in LangChain?
easy
A. To speed up the language model's response time
B. To train the language model with more data
C. To test how well the language model answers specific questions
D. To store user conversations permanently

Solution

  1. Step 1: Understand evaluation datasets

    Evaluation datasets contain example questions and expected answers to check model accuracy.
  2. Step 2: Identify the purpose in LangChain context

    They are used to test how well the model answers, not for training or storage.
  3. Final Answer:

    To test how well the language model answers specific questions -> Option C
  4. Quick Check:

    Evaluation datasets = test model accuracy [OK]
Hint: Evaluation datasets check model answers, not train it [OK]
Common Mistakes:
  • Confusing evaluation datasets with training data
  • Thinking evaluation datasets speed up the model
  • Assuming evaluation datasets store user data
2. Which of the following is the correct way to create an evaluation example in LangChain?
easy
A. example = ("What is AI?", "Artificial Intelligence")
B. example = "What is AI? -> Artificial Intelligence"
C. example = ["What is AI?", "Artificial Intelligence"]
D. example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"}

Solution

  1. Step 1: Recall LangChain evaluation example format

    Evaluation examples are dictionaries with keys like 'query' and 'expected_answer'.
  2. Step 2: Match the correct syntax

    example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"} uses a dictionary with proper keys, others use tuples, lists, or strings incorrectly.
  3. Final Answer:

    example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"} -> Option D
  4. Quick Check:

    Evaluation example = dictionary with keys [OK]
Hint: Use dictionary with 'query' and 'expected_answer' keys [OK]
Common Mistakes:
  • Using tuples or lists instead of dictionaries
  • Not using correct keys 'query' and 'expected_answer'
  • Using plain strings without structure
3. Given the following code snippet, what will be the output?
from langchain.evaluation.qa import QAEvalChain
examples = [{"query": "Capital of France?", "expected_answer": "Paris"}]
chain = QAEvalChain.from_llm(llm=None)
results = chain.evaluate(examples)
print(results)
medium
A. TypeError because llm=None is invalid
B. SyntaxError due to missing import
C. Empty list [] because no LLM provided
D. [{'query': 'Capital of France?', 'expected_answer': 'Paris', 'result': 'correct'}]

Solution

  1. Step 1: Analyze the QAEvalChain initialization

    The method from_llm requires a valid language model instance, not None.
  2. Step 2: Predict the error from invalid llm argument

    Passing None will cause a TypeError or similar because the chain cannot run without a model.
  3. Final Answer:

    TypeError because llm=None is invalid -> Option A
  4. Quick Check:

    Invalid llm argument = TypeError [OK]
Hint: QAEvalChain needs a valid LLM, None causes error [OK]
Common Mistakes:
  • Assuming None is a valid LLM
  • Expecting output without running the model
  • Ignoring required imports or parameters
4. You wrote this code to create evaluation examples but get an error:
examples = [{"query": "Who wrote Hamlet?", "answer": "Shakespeare"}]
chain = QAEvalChain.from_llm(llm=some_llm)
results = chain.evaluate(examples)
print(results)
What is the likely cause of the error?
medium
A. The variable some_llm is not defined
B. The key 'answer' should be 'expected_answer' in the example dictionary
C. QAEvalChain does not have an evaluate method
D. The examples list should be empty

Solution

  1. Step 1: Check example dictionary keys

    LangChain expects 'expected_answer' key, not 'answer', for evaluation examples.
  2. Step 2: Identify mismatch causing error

    Using 'answer' instead of 'expected_answer' causes the chain to fail reading expected answers.
  3. Final Answer:

    The key 'answer' should be 'expected_answer' in the example dictionary -> Option B
  4. Quick Check:

    Correct key name = 'expected_answer' [OK]
Hint: Use 'expected_answer' key, not 'answer' in examples [OK]
Common Mistakes:
  • Using wrong key names in example dictionaries
  • Assuming method names without checking docs
  • Ignoring variable definitions
5. You want to create an evaluation dataset with multiple examples and run QAEvalChain to check model accuracy. Which approach correctly prepares and evaluates the dataset?
hard
A. Prepare a list of dictionaries with 'query' and 'expected_answer', then call chain.evaluate(examples)
B. Prepare a list of tuples (query, expected_answer), then call chain.run(examples)
C. Prepare a dictionary with queries as keys and answers as values, then call chain.evaluate(examples)
D. Prepare a list of strings with 'query: answer' format, then call chain.run(examples)

Solution

  1. Step 1: Format evaluation dataset correctly

    LangChain expects a list of dictionaries with keys 'query' and 'expected_answer' for evaluation.
  2. Step 2: Use the correct method to evaluate

    The QAEvalChain uses the evaluate() method to process multiple examples at once.
  3. Final Answer:

    Prepare a list of dictionaries with 'query' and 'expected_answer', then call chain.evaluate(examples) -> Option A
  4. Quick Check:

    List of dicts + evaluate() = correct approach [OK]
Hint: Use list of dicts with evaluate() method for multiple examples [OK]
Common Mistakes:
  • Using tuples or dicts with wrong structure
  • Calling run() instead of evaluate() for batch evaluation
  • Passing strings instead of structured data