Bird
Raised Fist0
LangChainframework~20 mins

Automated evaluation pipelines in LangChain - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
LangChain Evaluation Master
Get all challenges correct to earn this badge!
Test your skills under time pressure!
component_behavior
intermediate
2:00remaining
What is the output of this LangChain evaluation pipeline snippet?

Consider this LangChain evaluation pipeline code that runs a simple evaluation on a list of inputs.

from langchain.evaluation import EvaluationChain
from langchain.schema import Document

def simple_eval_fn(doc: Document) -> bool:
    return "good" in doc.page_content

eval_chain = EvaluationChain.from_llm(llm=None, evaluation_fn=simple_eval_fn)
inputs = [Document(page_content="This is a good example."), Document(page_content="This is bad.")]
results = [eval_chain.evaluate(input_doc) for input_doc in inputs]
print(results)

What will be printed?

LangChain
from langchain.evaluation import EvaluationChain
from langchain.schema import Document

def simple_eval_fn(doc: Document) -> bool:
    return "good" in doc.page_content

eval_chain = EvaluationChain.from_llm(llm=None, evaluation_fn=simple_eval_fn)
inputs = [Document(page_content="This is a good example."), Document(page_content="This is bad.")]
results = [eval_chain.evaluate(input_doc) for input_doc in inputs]
print(results)
ATypeError: 'NoneType' object is not callable
B[False, True]
C["good", "bad"]
D[True, False]
Attempts:
2 left
💡 Hint

Check what the evaluation function returns for each document.

📝 Syntax
intermediate
2:00remaining
Which option correctly creates a LangChain evaluation pipeline with a custom metric?

You want to create an evaluation pipeline that uses a custom metric function returning a float score. Which code snippet is syntactically correct?

Aeval_chain = EvaluationChain.from_llm(llm=my_llm, evaluation_fn=lambda doc: 0.9)
Beval_chain = EvaluationChain.from_llm(llm=my_llm, evaluation_fn=lambda doc: return 0.9)
Ceval_chain = EvaluationChain.from_llm(llm=my_llm, evaluation_fn=lambda doc: {0.9})
Deval_chain = EvaluationChain.from_llm(llm=my_llm, evaluation_fn=lambda doc: [0.9])
Attempts:
2 left
💡 Hint

Remember how lambda functions return values in Python.

🔧 Debug
advanced
2:00remaining
Why does this LangChain evaluation pipeline raise an AttributeError?

Given this code snippet:

from langchain.evaluation import EvaluationChain

class MyEval:
    def __call__(self, doc):
        return len(doc.page_content)

eval_chain = EvaluationChain.from_llm(llm=None, evaluation_fn=MyEval())
result = eval_chain.evaluate("Test input")
print(result)

Why does it raise an AttributeError?

LangChain
from langchain.evaluation import EvaluationChain

class MyEval:
    def __call__(self, doc):
        return len(doc.page_content)

eval_chain = EvaluationChain.from_llm(llm=None, evaluation_fn=MyEval())
result = eval_chain.evaluate("Test input")
print(result)
A'MyEval' object is not callable because __call__ is missing
B'str' object has no attribute 'page_content' because evaluate was called with a string, not a Document
CTypeError because llm=None is invalid
DNameError because EvaluationChain is not imported
Attempts:
2 left
💡 Hint

Check the type of the argument passed to the evaluation function.

state_output
advanced
2:00remaining
What is the value of 'scores' after running this LangChain evaluation pipeline?

Consider this code:

from langchain.evaluation import EvaluationChain
from langchain.schema import Document

scores = []
def eval_fn(doc: Document) -> float:
    score = len(doc.page_content) / 10
    scores.append(score)
    return score

eval_chain = EvaluationChain.from_llm(llm=None, evaluation_fn=eval_fn)
inputs = [Document(page_content="Hello world!"), Document(page_content="LangChain")]
results = [eval_chain.evaluate(doc) for doc in inputs]

What is the final value of the list scores?

LangChain
from langchain.evaluation import EvaluationChain
from langchain.schema import Document

scores = []
def eval_fn(doc: Document) -> float:
    score = len(doc.page_content) / 10
    scores.append(score)
    return score

eval_chain = EvaluationChain.from_llm(llm=None, evaluation_fn=eval_fn)
inputs = [Document(page_content="Hello world!"), Document(page_content="LangChain")]
results = [eval_chain.evaluate(doc) for doc in inputs]
A[11, 9]
B[12, 9]
C[1.2, 0.9]
D[1.1, 0.9]
Attempts:
2 left
💡 Hint

Count the characters in each string and divide by 10.

🧠 Conceptual
expert
2:00remaining
Which option best describes the role of 'evaluation_fn' in LangChain's EvaluationChain?

In LangChain's EvaluationChain, what is the primary purpose of the evaluation_fn parameter?

AIt defines a function that takes a Document and returns a metric or boolean indicating evaluation results
BIt specifies the language model used to generate text outputs
CIt configures the logging level for the evaluation process
DIt initializes the database connection for storing evaluation data
Attempts:
2 left
💡 Hint

Think about what evaluation means in this context.

Practice

(1/5)
1. What is the main purpose of an automated evaluation pipeline in Langchain?
easy
A. To quickly test language model outputs against expected answers
B. To train new language models from scratch
C. To manually review each model output for quality
D. To deploy language models to production servers

Solution

  1. Step 1: Understand the role of evaluation pipelines

    Evaluation pipelines automatically compare model outputs to expected answers to check correctness.
  2. Step 2: Identify the main benefit

    This automation speeds up testing and helps catch errors early without manual review.
  3. Final Answer:

    To quickly test language model outputs against expected answers -> Option A
  4. Quick Check:

    Automated testing = Quick evaluation [OK]
Hint: Evaluation pipelines compare outputs to expected answers fast [OK]
Common Mistakes:
  • Confusing evaluation with training
  • Thinking evaluation is manual
  • Assuming deployment is part of evaluation
2. Which of the following is the correct way to create an evaluation pipeline in Langchain?
easy
A. pipeline = EvaluationPipeline(inputs, model, expected_outputs)
B. pipeline = EvaluationPipeline(model, inputs, expected_outputs)
C. pipeline = EvaluationPipeline(expected_outputs, inputs, model)
D. pipeline = EvaluationPipeline(inputs, expected_outputs, model)

Solution

  1. Step 1: Recall the order of parameters

    The EvaluationPipeline constructor expects inputs first, then the model, then expected outputs.
  2. Step 2: Match the correct parameter order

    pipeline = EvaluationPipeline(inputs, model, expected_outputs) matches this order exactly, others mix the sequence causing errors.
  3. Final Answer:

    pipeline = EvaluationPipeline(inputs, model, expected_outputs) -> Option A
  4. Quick Check:

    Inputs, model, expected outputs order [OK]
Hint: Remember: inputs first, then model, then expected outputs [OK]
Common Mistakes:
  • Swapping model and inputs order
  • Putting expected outputs before inputs
  • Using wrong parameter sequence causing errors
3. Given this code snippet, what will be the output of results?
inputs = ["Hello", "World"]
model = lambda x: x.lower()
expected = ["hello", "world"]
pipeline = EvaluationPipeline(inputs, model, expected)
results = pipeline.run()
medium
A. [True, False]
B. [False, False]
C. [True, True]
D. RuntimeError

Solution

  1. Step 1: Understand the model function

    The model converts each input string to lowercase, so "Hello" -> "hello" and "World" -> "world".
  2. Step 2: Compare model outputs to expected

    Both outputs match the expected list exactly, so evaluation returns True for both.
  3. Final Answer:

    [True, True] -> Option C
  4. Quick Check:

    Lowercase matches expected = True [OK]
Hint: Check if model output matches expected exactly [OK]
Common Mistakes:
  • Assuming case does not matter
  • Expecting runtime error from lambda
  • Mixing up True and False results
4. You wrote this evaluation pipeline but it raises an error:
inputs = ["Test"]
model = "not a function"
expected = ["test"]
pipeline = EvaluationPipeline(inputs, model, expected)
pipeline.run()
What is the likely cause?
medium
A. Inputs list cannot have only one item
B. Model must be a callable function, not a string
C. Expected outputs must be integers
D. EvaluationPipeline requires three arguments, but only two were given

Solution

  1. Step 1: Check the model parameter type

    The model should be a function that processes inputs, but here it is a string, which is not callable.
  2. Step 2: Understand the error cause

    Calling pipeline.run() tries to call the model on inputs, causing a TypeError because strings can't be called like functions.
  3. Final Answer:

    Model must be a callable function, not a string -> Option B
  4. Quick Check:

    Model callable required, string given [OK]
Hint: Model must be a function, not a string [OK]
Common Mistakes:
  • Thinking inputs size causes error
  • Expecting output type to be integer
  • Miscounting constructor arguments
5. You want to evaluate a language model that sometimes returns empty strings for some inputs. How should you modify your automated evaluation pipeline to handle this edge case correctly?
hard
A. Replace empty string outputs with None before evaluation
B. Treat empty string outputs as incorrect regardless of expected answer
C. Ignore inputs that produce empty strings in the evaluation
D. Filter out empty string outputs before comparing to expected answers

Solution

  1. Step 1: Identify the problem with empty strings

    Empty string outputs can cause false negatives if compared directly to expected answers.
  2. Step 2: Implement filtering before comparison

    Filtering out empty strings ensures only meaningful outputs are evaluated, avoiding misleading failures.
  3. Step 3: Avoid ignoring inputs or forcing None

    Ignoring inputs or replacing outputs can hide real issues or cause errors in evaluation.
  4. Final Answer:

    Filter out empty string outputs before comparing to expected answers -> Option D
  5. Quick Check:

    Filter empty outputs to avoid false errors [OK]
Hint: Filter empty outputs before evaluation to avoid false failures [OK]
Common Mistakes:
  • Ignoring inputs with empty outputs
  • Replacing empty strings with None causing errors
  • Counting empty strings as always wrong