Which of the following best explains why evaluating a Large Language Model (LLM) is crucial?
Think about what evaluation tells us about the model's output quality.
Evaluation measures how well the LLM performs on tasks, ensuring it produces accurate and relevant answers. This helps maintain quality.
When evaluating an LLM's text generation, which metric is commonly used to measure how well the output matches expected results?
Look for a metric designed for comparing generated text to reference text.
BLEU score compares the similarity between generated text and reference text, making it suitable for evaluating LLM outputs.
Given the following Python code that evaluates a simple LLM output against a reference, what is the printed accuracy?
predictions = ['hello world', 'machine learning', 'open ai'] references = ['hello world', 'machine learning', 'openai'] correct = sum(p == r for p, r in zip(predictions, references)) accuracy = correct / len(predictions) print(f"Accuracy: {accuracy:.2f}")
Count how many predictions exactly match the references.
Two predictions match exactly ('hello world' and 'machine learning'), one does not ('open ai' vs 'openai'), so accuracy is 2/3 = 0.67.
To ensure quality, which evaluation method is most suitable for detecting bias in a Large Language Model's responses?
Think about how bias can be identified beyond numeric scores.
Human reviewers can spot biased or unfair responses by testing the model with diverse prompts, which automated metrics may miss.
Consider this Python code snippet intended to calculate the average loss from a list of losses. What error does it raise?
losses = [0.25, 0.30, 0.20] average_loss = sum(losses) / len(losses) print(f"Average loss: {average_loss:.2f}")
Check variable names carefully for typos.
The variable 'loss' is not defined; the list is named 'losses'. This causes a NameError.