Bird
Raised Fist0
Prompt Engineering / GenAIml~8 mins

Top-p and top-k sampling in Prompt Engineering / GenAI - Model Metrics & Evaluation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Metrics & Evaluation - Top-p and top-k sampling
Which metric matters for Top-p and Top-k sampling and WHY

For Top-p and Top-k sampling, the key metric is perplexity. Perplexity measures how well the language model predicts the next word. Lower perplexity means the model is more confident and accurate in its predictions.

Additionally, diversity metrics like distinct-n (unique n-grams) help measure how varied the generated text is. Top-p and Top-k control randomness, so balancing perplexity and diversity is important.

Confusion matrix or equivalent visualization

Top-p and Top-k sampling do not use confusion matrices because they generate text probabilistically rather than classify fixed labels.

Instead, we visualize the probability distribution over the vocabulary at each step. For example:

    Vocabulary: ["cat", "dog", "mouse", "elephant", "lion"]
    Probabilities: [0.4, 0.3, 0.15, 0.1, 0.05]

    - Top-k=3 selects top 3 words: "cat", "dog", "mouse"
    - Top-p=0.7 selects words until cumulative prob >= 0.7: "cat" (0.4) + "dog" (0.3) = 0.7
    

This shows how sampling narrows choices to control randomness.

Precision vs Recall tradeoff with concrete examples

Top-p and Top-k sampling balance quality and diversity in generated text.

  • Top-k too low: Only a few words considered, text is repetitive and safe but boring (low diversity).
  • Top-k too high: Many words considered, text is diverse but may be nonsensical (low quality).
  • Top-p too low: Only very probable words chosen, text is predictable but dull.
  • Top-p too high: Includes rare words, text is creative but can be confusing.

Choosing the right threshold depends on whether you want safer or more creative outputs.

What "good" vs "bad" metric values look like for Top-p and Top-k sampling

Good values:

  • Perplexity: Moderate (not too low or high), indicating confident but flexible predictions.
  • Diversity (distinct-n): Balanced, showing varied but coherent text.
  • Human evaluation: Text is fluent, relevant, and interesting.

Bad values:

  • Perplexity too low: Text is repetitive and dull.
  • Perplexity too high: Text is random and nonsensical.
  • Diversity too low: Same phrases repeated.
  • Diversity too high: Text loses meaning.
Metrics pitfalls
  • Ignoring diversity: Only measuring perplexity can miss dull repetitive text.
  • Overfitting: Model memorizes training text, leading to low perplexity but poor creativity.
  • Data leakage: If test prompts appear in training, metrics are misleadingly good.
  • Misinterpreting sampling parameters: Confusing top-p and top-k effects can lead to wrong tuning.
Self-check question

Your language model uses top-k sampling with k=5 and shows low perplexity but very repetitive text. Is this good? Why or why not?

Answer: No, it is not good. Low perplexity means the model is confident, but repetitive text shows low diversity. The top-k value might be too low, limiting creativity. You should increase k or adjust top-p to balance quality and diversity.

Key Result
Top-p and top-k sampling balance prediction confidence (perplexity) and text diversity to produce fluent yet creative language outputs.

Practice

(1/5)
1. What does top-k sampling do in text generation?
easy
A. It selects the next word from the top k most likely words.
B. It selects the next word randomly from all possible words.
C. It picks words until their total probability reaches p.
D. It always picks the single most likely next word.

Solution

  1. Step 1: Understand top-k sampling definition

    Top-k sampling limits choices to the top k words with highest probabilities.
  2. Step 2: Compare with other methods

    Random selection from all possible words and picking words until total probability reaches p describe other methods; always picking the single most likely next word is greedy decoding, not sampling.
  3. Final Answer:

    It selects the next word from the top k most likely words. -> Option A
  4. Quick Check:

    Top-k = top k words [OK]
Hint: Top-k means pick from top k words only [OK]
Common Mistakes:
  • Confusing top-k with top-p sampling
  • Thinking top-k picks only one word always
  • Mixing top-k with greedy decoding
2. Which of the following is the correct way to apply top-p sampling in code?
easy
A. Select words until their cumulative probability exceeds p.
B. Select exactly p words with highest probabilities.
C. Select the single word with probability p.
D. Select words randomly ignoring probabilities.

Solution

  1. Step 1: Recall top-p sampling definition

    Top-p sampling chooses the smallest set of words whose total probability is at least p.
  2. Step 2: Evaluate options

    Selecting words until their cumulative probability exceeds p matches this definition. Selecting exactly p words confuses top-p with top-k. Random selection ignoring probabilities and selecting a single word with probability p are incorrect.
  3. Final Answer:

    Select words until their cumulative probability exceeds p. -> Option A
  4. Quick Check:

    Top-p = cumulative probability ≥ p [OK]
Hint: Top-p sums probabilities to reach p [OK]
Common Mistakes:
  • Confusing number of words with cumulative probability
  • Thinking top-p picks fixed number of words
  • Ignoring word probabilities in selection
3. Given these word probabilities sorted descending: {'a': 0.4, 'b': 0.3, 'c': 0.2, 'd': 0.1}, what words are included in top-p sampling with p=0.7?
medium
A. ['a']
B. ['a', 'b', 'c']
C. ['a', 'b']
D. ['a', 'b', 'c', 'd']

Solution

  1. Step 1: Calculate cumulative probabilities

    Sum probabilities in order: 'a' = 0.4, 'a'+'b' = 0.7, 'a'+'b'+'c' = 0.9.
  2. Step 2: Select smallest set ≥ p=0.7

    The smallest set with sum ≥ 0.7 is ['a', 'b'].
  3. Final Answer:

    ['a', 'b'] -> Option C
  4. Quick Check:

    Cumulative sum ≥ 0.7 includes 'a' and 'b' [OK]
Hint: Sum probabilities until ≥ p [OK]
Common Mistakes:
  • Including too many words beyond p
  • Stopping before reaching p
  • Confusing top-p with top-k count
4. You wrote code for top-k sampling but it always picks only one word. What is the likely bug?
medium
A. You summed probabilities instead of sorting words.
B. You set k=1 instead of a larger number.
C. You used top-p sampling code instead of top-k.
D. You forgot to normalize probabilities.

Solution

  1. Step 1: Understand top-k parameter effect

    Setting k=1 means only the single most likely word is chosen.
  2. Step 2: Check other options

    Summing probabilities or mixing methods won't cause always one word; normalization affects probabilities but not count.
  3. Final Answer:

    You set k=1 instead of a larger number. -> Option B
  4. Quick Check:

    k=1 picks only one word [OK]
Hint: Check if k=1 limits output to one word [OK]
Common Mistakes:
  • Confusing top-k and top-p parameters
  • Ignoring parameter values in code
  • Assuming normalization fixes count
5. You want to generate text that balances creativity and coherence. Which approach is best?
hard
A. Use random sampling ignoring probabilities.
B. Use greedy decoding to always pick the most likely word.
C. Use top-k sampling with k=1 only.
D. Use top-k sampling with a moderate k and top-p sampling with p around 0.9 together.

Solution

  1. Step 1: Understand creativity vs coherence tradeoff

    Greedy decoding is too rigid; random sampling is too chaotic; top-k with k=1 is greedy.
  2. Step 2: Combine top-k and top-p for balance

    Using moderate k and p near 0.9 limits choices to plausible words but allows variety, improving naturalness.
  3. Final Answer:

    Use top-k sampling with a moderate k and top-p sampling with p around 0.9 together. -> Option D
  4. Quick Check:

    Combining top-k and top-p balances randomness and coherence [OK]
Hint: Combine top-k and top-p for best text quality [OK]
Common Mistakes:
  • Choosing greedy decoding for creativity
  • Ignoring probability thresholds
  • Using too small k or p values