Bird
Raised Fist0
LangChainframework~20 mins

Creating evaluation datasets in LangChain - Practice Exercises

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
LangChain Evaluation Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
component_behavior
intermediate
2:00remaining
What is the output of this LangChain evaluation dataset creation code?
Consider this Python code snippet using LangChain to create an evaluation dataset. What will be the content of eval_dataset after running this?
LangChain
from langchain.evaluation.loading import load_evaluation_dataset

# Load a dataset with 3 examples
eval_dataset = load_evaluation_dataset(
    name="custom_dataset",
    data=[
        {"input": "Hello", "output": "Hi"},
        {"input": "Bye", "output": "Goodbye"},
        {"input": "Thanks", "output": "You are welcome"}
    ]
)

print(len(eval_dataset))
A1
B0
C3
DRaises a TypeError
Attempts:
2 left
💡 Hint
Think about how many items are in the data list passed to load_evaluation_dataset.
📝 Syntax
intermediate
2:00remaining
Which option correctly creates an evaluation dataset with LangChain from a JSON file?
You want to create an evaluation dataset by loading data from a JSON file named data.json. Which code snippet correctly does this using LangChain?
A
from langchain.evaluation.loading import load_evaluation_dataset

eval_dataset = load_evaluation_dataset(file="data.json")
B
from langchain.evaluation.loading import load_evaluation_dataset

eval_dataset = load_evaluation_dataset(path="data.json")
C
from langchain.evaluation.loading import load_evaluation_dataset

eval_dataset = load_evaluation_dataset(name="json")
D
from langchain.evaluation.loading import load_evaluation_dataset

eval_dataset = load_evaluation_dataset(name="json", path="data.json")
Attempts:
2 left
💡 Hint
The function requires the dataset type name and the path to the file.
state_output
advanced
2:00remaining
What is the value of dataset[1]['output'] after this code runs?
Given this code creating an evaluation dataset, what is the value of dataset[1]['output']?
LangChain
from langchain.evaluation.loading import load_evaluation_dataset

dataset = load_evaluation_dataset(
    name="custom_dataset",
    data=[
        {"input": "Q1", "output": "A1"},
        {"input": "Q2", "output": "A2"},
        {"input": "Q3", "output": "A3"}
    ]
)

result = dataset[1]['output']
A"A2"
B"Q2"
C"A3"
DRaises an IndexError
Attempts:
2 left
💡 Hint
Remember Python lists are zero-indexed.
🔧 Debug
advanced
2:00remaining
Why does this code raise a TypeError when creating an evaluation dataset?
This code snippet raises a TypeError. What is the cause?
LangChain
from langchain.evaluation.loading import load_evaluation_dataset

data = {"input": "Hello", "output": "Hi"}
eval_dataset = load_evaluation_dataset(name="custom_dataset", data=data)
AThe name parameter is invalid and causes the error
BThe data parameter must be a list of dictionaries, not a single dictionary
Cload_evaluation_dataset does not accept a data parameter
DThe import statement is incorrect
Attempts:
2 left
💡 Hint
Check the type of the data argument passed.
🧠 Conceptual
expert
2:00remaining
Which option best describes the purpose of creating evaluation datasets in LangChain?
Why do developers create evaluation datasets when working with LangChain?
ATo test and measure the performance of language models on specific tasks
BTo speed up the training process of language models
CTo generate new training data automatically
DTo deploy language models to production environments
Attempts:
2 left
💡 Hint
Evaluation datasets help check how well something works.

Practice

(1/5)
1. What is the main purpose of creating evaluation datasets in LangChain?
easy
A. To speed up the language model's response time
B. To train the language model with more data
C. To test how well the language model answers specific questions
D. To store user conversations permanently

Solution

  1. Step 1: Understand evaluation datasets

    Evaluation datasets contain example questions and expected answers to check model accuracy.
  2. Step 2: Identify the purpose in LangChain context

    They are used to test how well the model answers, not for training or storage.
  3. Final Answer:

    To test how well the language model answers specific questions -> Option C
  4. Quick Check:

    Evaluation datasets = test model accuracy [OK]
Hint: Evaluation datasets check model answers, not train it [OK]
Common Mistakes:
  • Confusing evaluation datasets with training data
  • Thinking evaluation datasets speed up the model
  • Assuming evaluation datasets store user data
2. Which of the following is the correct way to create an evaluation example in LangChain?
easy
A. example = ("What is AI?", "Artificial Intelligence")
B. example = "What is AI? -> Artificial Intelligence"
C. example = ["What is AI?", "Artificial Intelligence"]
D. example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"}

Solution

  1. Step 1: Recall LangChain evaluation example format

    Evaluation examples are dictionaries with keys like 'query' and 'expected_answer'.
  2. Step 2: Match the correct syntax

    example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"} uses a dictionary with proper keys, others use tuples, lists, or strings incorrectly.
  3. Final Answer:

    example = {"query": "What is AI?", "expected_answer": "Artificial Intelligence"} -> Option D
  4. Quick Check:

    Evaluation example = dictionary with keys [OK]
Hint: Use dictionary with 'query' and 'expected_answer' keys [OK]
Common Mistakes:
  • Using tuples or lists instead of dictionaries
  • Not using correct keys 'query' and 'expected_answer'
  • Using plain strings without structure
3. Given the following code snippet, what will be the output?
from langchain.evaluation.qa import QAEvalChain
examples = [{"query": "Capital of France?", "expected_answer": "Paris"}]
chain = QAEvalChain.from_llm(llm=None)
results = chain.evaluate(examples)
print(results)
medium
A. TypeError because llm=None is invalid
B. SyntaxError due to missing import
C. Empty list [] because no LLM provided
D. [{'query': 'Capital of France?', 'expected_answer': 'Paris', 'result': 'correct'}]

Solution

  1. Step 1: Analyze the QAEvalChain initialization

    The method from_llm requires a valid language model instance, not None.
  2. Step 2: Predict the error from invalid llm argument

    Passing None will cause a TypeError or similar because the chain cannot run without a model.
  3. Final Answer:

    TypeError because llm=None is invalid -> Option A
  4. Quick Check:

    Invalid llm argument = TypeError [OK]
Hint: QAEvalChain needs a valid LLM, None causes error [OK]
Common Mistakes:
  • Assuming None is a valid LLM
  • Expecting output without running the model
  • Ignoring required imports or parameters
4. You wrote this code to create evaluation examples but get an error:
examples = [{"query": "Who wrote Hamlet?", "answer": "Shakespeare"}]
chain = QAEvalChain.from_llm(llm=some_llm)
results = chain.evaluate(examples)
print(results)
What is the likely cause of the error?
medium
A. The variable some_llm is not defined
B. The key 'answer' should be 'expected_answer' in the example dictionary
C. QAEvalChain does not have an evaluate method
D. The examples list should be empty

Solution

  1. Step 1: Check example dictionary keys

    LangChain expects 'expected_answer' key, not 'answer', for evaluation examples.
  2. Step 2: Identify mismatch causing error

    Using 'answer' instead of 'expected_answer' causes the chain to fail reading expected answers.
  3. Final Answer:

    The key 'answer' should be 'expected_answer' in the example dictionary -> Option B
  4. Quick Check:

    Correct key name = 'expected_answer' [OK]
Hint: Use 'expected_answer' key, not 'answer' in examples [OK]
Common Mistakes:
  • Using wrong key names in example dictionaries
  • Assuming method names without checking docs
  • Ignoring variable definitions
5. You want to create an evaluation dataset with multiple examples and run QAEvalChain to check model accuracy. Which approach correctly prepares and evaluates the dataset?
hard
A. Prepare a list of dictionaries with 'query' and 'expected_answer', then call chain.evaluate(examples)
B. Prepare a list of tuples (query, expected_answer), then call chain.run(examples)
C. Prepare a dictionary with queries as keys and answers as values, then call chain.evaluate(examples)
D. Prepare a list of strings with 'query: answer' format, then call chain.run(examples)

Solution

  1. Step 1: Format evaluation dataset correctly

    LangChain expects a list of dictionaries with keys 'query' and 'expected_answer' for evaluation.
  2. Step 2: Use the correct method to evaluate

    The QAEvalChain uses the evaluate() method to process multiple examples at once.
  3. Final Answer:

    Prepare a list of dictionaries with 'query' and 'expected_answer', then call chain.evaluate(examples) -> Option A
  4. Quick Check:

    List of dicts + evaluate() = correct approach [OK]
Hint: Use list of dicts with evaluate() method for multiple examples [OK]
Common Mistakes:
  • Using tuples or dicts with wrong structure
  • Calling run() instead of evaluate() for batch evaluation
  • Passing strings instead of structured data