For large language models (LLMs) that understand and generate text, the key metrics are perplexity and accuracy on language tasks. Perplexity measures how well the model predicts the next word in a sentence. Lower perplexity means the model better understands language patterns. Accuracy on tasks like question answering or text classification shows how well the model generates meaningful and correct text. These metrics matter because they tell us if the model truly grasps language structure and meaning.
Why LLMs understand and generate text in Prompt Engineering / GenAI - Why Metrics Matter
Start learning this pattern below
Jump into concepts and practice - no test required
For text generation, a confusion matrix is less common, but for classification tasks done by LLMs, it looks like this:
| Predicted Positive | Predicted Negative
-------------------------------------------
Actual Positive | TP = 80 | FN = 20
Actual Negative | FP = 10 | TN = 90
This helps calculate precision and recall, showing how well the model distinguishes correct from incorrect answers.
When LLMs generate text, sometimes they must balance precision (being correct) and recall (covering all relevant info). For example, in a chatbot answering questions, high precision means answers are accurate and trustworthy. High recall means the model tries to cover all possible correct answers, even if some are less precise. If the model is too cautious (high precision, low recall), it may miss useful info. If it tries to say everything (high recall, low precision), it may give wrong or confusing answers.
A good LLM has low perplexity (e.g., below 20 on standard datasets) and high accuracy (above 85%) on language tasks. This means it predicts words well and generates meaningful text. A bad model has high perplexity (above 50) and low accuracy (below 60%), showing poor understanding and confusing output. For classification tasks, good precision and recall are both above 80%. If one is very low, the model either misses important info or makes many mistakes.
One pitfall is the accuracy paradox: a model might have high accuracy by guessing common words but fail to understand rare or complex language. Data leakage happens if the model sees test examples during training, inflating metrics falsely. Overfitting means the model performs well on training data but poorly on new text, showing low generalization. Monitoring perplexity on unseen data helps detect this.
Your LLM has 98% accuracy on training text but 12% recall on rare language tasks. Is it good for production? Why not?
Answer: No, it is not good. The low recall on rare tasks means the model misses many important cases, even if it looks accurate on common text. This shows poor understanding of diverse language, so it may fail in real use.
Practice
Solution
Step 1: Understand how LLMs learn
LLMs learn by analyzing many examples of text to find patterns and relationships between words.Step 2: Recognize pattern learning enables text generation
By learning these patterns, LLMs can predict and generate new text that makes sense.Final Answer:
Because they learn patterns from large amounts of text data -> Option CQuick Check:
Pattern learning = B [OK]
- Thinking LLMs memorize all text exactly
- Believing LLMs use fixed human rules
- Assuming LLMs convert text to images first
Solution
Step 1: Identify the text generation method
LLMs generate text by predicting the next word using the context of previous words.Step 2: Eliminate incorrect options
Random picking ignores context, translating without patterns is wrong, and repeating only the first sentence is false.Final Answer:
They predict the next word based on previous words -> Option BQuick Check:
Next word prediction = D [OK]
- Thinking words are chosen randomly
- Believing LLMs do not use context
- Assuming LLMs only repeat learned sentences
context = ['I', 'love'] next_word = 'cats' output = ' '.join(context + [next_word]) print(output)What will be printed?
Solution
Step 1: Understand the code concatenation
The code joins the list ['I', 'love'] with ['cats'] to form ['I', 'love', 'cats'].Step 2: Join list elements into a string
Using ' '.join(...) creates the string 'I love cats'.Final Answer:
I love cats -> Option AQuick Check:
Joining words = C [OK]
- Mixing word order in output
- Forgetting to join all words
- Printing only part of the list
context = ['Hello', 'world'] next_word = 123 output = ' '.join(context + [next_word]) print(output)What is the error and how to fix it?
Solution
Step 1: Identify the error type
Joining strings with an integer causes a TypeError because join expects strings.Step 2: Fix the error by converting integer to string
Convert next_word to string using str(next_word) before joining.Final Answer:
TypeError because next_word is int; fix by converting to string -> Option AQuick Check:
TypeError fix = A [OK]
- Thinking it's a syntax error
- Ignoring type mismatch in join
- Assuming code runs without error
Solution
Step 1: Understand input relevance for summarization
Providing the full article gives the LLM enough context to understand main points.Step 2: Recognize why other options fail
Using only the first sentence, random sentences, or unrelated text lacks context, leading to poor summaries.Final Answer:
Feed the entire article as input and ask for a summary -> Option DQuick Check:
Full context input = A [OK]
- Using partial or random text as input
- Ignoring importance of full context
- Expecting summary from unrelated text
