Bird
Raised Fist0
Prompt Engineering / GenAIml~15 mins

Top-p and top-k sampling in Prompt Engineering / GenAI - Deep Dive

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Overview - Top-p and top-k sampling
What is it?
Top-p and top-k sampling are methods used to pick the next word or token when a language model generates text. Instead of always choosing the most likely word, these methods add randomness by selecting from a smaller set of probable words. Top-k sampling picks from the top k most likely words, while top-p sampling picks from the smallest group of words whose combined probability is at least p. This helps make generated text more diverse and natural.
Why it matters
Without these sampling methods, language models would often produce repetitive or boring text by always choosing the most likely word. This would make conversations or stories feel unnatural and robotic. Top-p and top-k sampling allow models to balance between making sensible choices and adding creativity, making AI-generated text more engaging and useful in real life.
Where it fits
Before learning top-p and top-k sampling, you should understand how language models predict the next word using probabilities. After this, you can explore other sampling techniques like temperature scaling and beam search, and then move on to fine-tuning models for specific tasks.
Mental Model
Core Idea
Top-p and top-k sampling pick the next word from a smaller, more likely group of words to balance making good and creative choices.
Think of it like...
Imagine you are at an ice cream shop with many flavors. Instead of always picking the most popular flavor, you choose from the top few popular flavors (top-k) or from flavors that together make up most of the customers' choices (top-p). This way, you get variety but still enjoy popular tastes.
Probability distribution of next words:

Words sorted by probability:
┌─────────────┬───────────────┐
│ Word        │ Probability  │
├─────────────┼───────────────┤
│ the         │ 0.30          │
│ a           │ 0.20          │
│ cat         │ 0.15          │
│ dog         │ 0.10          │
│ runs        │ 0.08          │
│ jumps       │ 0.07          │
│ quickly     │ 0.05          │
│ slowly      │ 0.05          │

Top-k=3 picks from {the, a, cat}
Top-p=0.7 picks from {the, a, cat, dog} because 0.30+0.20+0.15=0.65 < 0.7, add next word dog (0.10) to reach 0.75 > 0.7
Build-Up - 7 Steps
1
FoundationUnderstanding Language Model Predictions
🤔
Concept: Language models predict the next word by assigning probabilities to all possible words.
A language model looks at the words so far and calculates how likely each possible next word is. For example, after 'The cat', it might say 'sat' has 0.4 chance, 'runs' 0.3, 'jumps' 0.2, and others share the rest.
Result
You get a list of words with probabilities that sum to 1, showing how likely each word is to come next.
Understanding that language models produce a probability distribution is key to knowing how sampling methods decide the next word.
2
FoundationWhy Randomness Helps Text Generation
🤔
Concept: Choosing the highest probability word every time makes text boring and repetitive.
If the model always picks the most likely word, sentences become predictable and dull. For example, always picking 'the' after 'The' leads to repetitive phrases. Adding randomness by sampling from probable words creates more interesting and varied text.
Result
Text becomes more natural and less robotic when randomness is introduced.
Knowing why randomness is needed helps appreciate why sampling methods like top-p and top-k exist.
3
IntermediateHow Top-k Sampling Works
🤔Before reading on: do you think top-k sampling picks words only from the top k words or from all words but favors the top k? Commit to your answer.
Concept: Top-k sampling limits the choice to the k most probable words and picks randomly among them.
After the model predicts probabilities, top-k sampling sorts words by probability and keeps only the top k words. It then normalizes their probabilities to sum to 1 and randomly picks one. For example, with k=3, only the top 3 words are considered.
Result
The next word is chosen from a smaller set, adding randomness but keeping choices sensible.
Understanding top-k sampling shows how limiting choices to a fixed number controls randomness and diversity.
4
IntermediateHow Top-p (Nucleus) Sampling Works
🤔Before reading on: do you think top-p sampling always picks a fixed number of words or a variable number based on probabilities? Commit to your answer.
Concept: Top-p sampling picks from the smallest set of words whose combined probability is at least p.
Words are sorted by probability. Starting from the top, words are added until their total probability reaches or exceeds p (like 0.9). This set can vary in size. Then one word is randomly chosen from this set after normalizing probabilities.
Result
The next word is chosen from a dynamic set that adapts to the shape of the probability distribution.
Knowing top-p sampling adapts the candidate set size helps understand its flexibility compared to top-k.
5
IntermediateComparing Top-k and Top-p Sampling
🤔Before reading on: which method do you think adapts better to different probability shapes, top-k or top-p? Commit to your answer.
Concept: Top-k uses a fixed number of words, while top-p uses a variable number based on cumulative probability.
Top-k always picks from the same number of words, which can be too small or too large depending on the distribution. Top-p adjusts the number of words to cover a probability mass, making it more flexible. For example, if probabilities are spread out, top-p might pick more words than top-k.
Result
Top-p often produces more natural and diverse text by adapting to the model's confidence.
Understanding the difference helps choose the right sampling method for different tasks.
6
AdvancedBalancing Creativity and Coherence with Sampling
🤔Before reading on: do you think increasing k or p always improves text quality? Commit to your answer.
Concept: Adjusting k or p controls the trade-off between safe, predictable text and creative, diverse text.
Higher k or p means more words to choose from, increasing creativity but risking nonsense. Lower values make text safer but repetitive. Finding the right balance depends on the task, like storytelling or factual answers.
Result
Proper tuning of sampling parameters leads to better text quality for the intended use.
Knowing how sampling parameters affect output quality is crucial for practical applications.
7
ExpertSurprising Effects of Sampling on Model Biases
🤔Before reading on: do you think sampling methods can affect the biases in generated text? Commit to your answer.
Concept: Sampling methods influence which biases in the model become more or less visible in generated text.
Because top-k and top-p limit choices, they can amplify or reduce certain biases. For example, rare but biased words might be excluded or included depending on parameters. Also, sampling can affect repetition and factual accuracy in subtle ways.
Result
Sampling choices impact not just creativity but also fairness and reliability of AI outputs.
Understanding sampling's role in bias helps experts design safer and more trustworthy AI systems.
Under the Hood
Language models output a probability distribution over all possible next tokens. Top-k sampling sorts these tokens by probability and truncates the list to the top k tokens, then samples from this truncated list after renormalizing probabilities. Top-p sampling sorts tokens and includes tokens cumulatively until their total probability exceeds p, then samples from this dynamic set. Both methods rely on sorting and renormalizing probabilities before random selection.
Why designed this way?
These methods were created to avoid the pitfalls of always picking the highest probability token, which leads to dull text, and to improve over naive random sampling that can pick unlikely words. Top-k was simpler but fixed in size, while top-p was introduced to adapt to the model's confidence dynamically, improving text quality and diversity.
Model output probabilities
          ↓
  ┌─────────────────────┐
  │ Sort tokens by prob  │
  └─────────┬───────────┘
            │
    ┌───────┴────────┐
    │                │
Top-k sampling   Top-p sampling
    │                │
Keep top k tokens  Keep tokens until cumulative prob ≥ p
    │                │
Renormalize probs  Renormalize probs
    │                │
Randomly sample one token
            ↓
      Next word chosen
Myth Busters - 4 Common Misconceptions
Quick: Does top-k sampling always pick exactly k words to sample from? Commit yes or no.
Common Belief:Top-k sampling always picks exactly k words to sample from, no more, no less.
Tap to reveal reality
Reality:Top-k sampling picks up to k words, but if the model's vocabulary is smaller or probabilities are tied, it might pick fewer. Also, some implementations may exclude tokens with zero probability.
Why it matters:Assuming exactly k words are always sampled can lead to misunderstanding model behavior and tuning errors.
Quick: Does top-p sampling always pick the same number of words for every prediction? Commit yes or no.
Common Belief:Top-p sampling picks a fixed number of words like top-k, just based on a probability threshold.
Tap to reveal reality
Reality:Top-p sampling picks a variable number of words depending on the shape of the probability distribution, which can change every prediction.
Why it matters:Misunderstanding this can cause confusion when tuning parameters or debugging generation results.
Quick: Does increasing top-k or top-p always make generated text better? Commit yes or no.
Common Belief:Increasing top-k or top-p always improves text quality by adding more choices.
Tap to reveal reality
Reality:Too high values can cause the model to pick unlikely or nonsensical words, reducing coherence and quality.
Why it matters:Blindly increasing parameters can degrade output, wasting resources and causing poor user experience.
Quick: Can sampling methods affect the biases present in model outputs? Commit yes or no.
Common Belief:Sampling methods only affect randomness and diversity, not biases in the model.
Tap to reveal reality
Reality:Sampling can amplify or reduce biases by changing which tokens are likely to be chosen, affecting fairness and safety.
Why it matters:Ignoring this can lead to unexpected biased or harmful outputs in production systems.
Expert Zone
1
Top-p sampling can dynamically adjust to the model's confidence, sometimes selecting very few tokens when the model is sure, and many when uncertain.
2
Combining temperature scaling with top-p or top-k sampling can finely control randomness and output diversity.
3
Sampling methods interact with tokenization granularity; subword tokens can affect how sampling choices translate to meaningful words.
When NOT to use
Top-p and top-k sampling are not ideal when deterministic or highly accurate outputs are needed, such as in legal or medical text generation. In such cases, beam search or greedy decoding is preferred for consistency and precision.
Production Patterns
In real-world systems, top-p sampling with p around 0.9 is common for chatbots to balance creativity and coherence. Developers often combine sampling with repetition penalties and temperature tuning. Monitoring output diversity and bias is standard practice to maintain quality.
Connections
Temperature Scaling
Builds-on
Temperature scaling changes the shape of the probability distribution before applying top-p or top-k sampling, allowing finer control over randomness and creativity.
Beam Search
Opposite approach
Beam search focuses on finding the most likely sequences deterministically, contrasting with the randomness of top-p and top-k sampling, highlighting different trade-offs between diversity and accuracy.
Decision Making Under Uncertainty (Psychology)
Similar pattern
Top-p sampling's method of choosing from a cumulative probability threshold mirrors how humans consider options until they feel confident enough to decide, linking AI sampling to human cognitive strategies.
Common Pitfalls
#1Setting top-k too high causing nonsensical outputs.
Wrong approach:top_k = 1000 next_word = sample_from_top_k(probabilities, top_k)
Correct approach:top_k = 50 next_word = sample_from_top_k(probabilities, top_k)
Root cause:Misunderstanding that a very large k defeats the purpose of limiting choices and increases chance of picking unlikely words.
#2Confusing top-p with a fixed number of tokens to sample from.
Wrong approach:top_p = 0.9 # Assuming always picks exactly 10 tokens candidates = get_top_p_tokens(probabilities, top_p) assert len(candidates) == 10
Correct approach:top_p = 0.9 candidates = get_top_p_tokens(probabilities, top_p) # candidates length varies depending on cumulative probability
Root cause:Not realizing top-p sampling adapts candidate set size dynamically.
#3Using top-k or top-p sampling without renormalizing probabilities.
Wrong approach:candidates = get_top_k_tokens(probabilities, k) next_word = random_choice(candidates, original_probabilities)
Correct approach:candidates = get_top_k_tokens(probabilities, k) renormalized_probs = normalize(candidates.probabilities) next_word = random_choice(candidates, renormalized_probs)
Root cause:Forgetting that probabilities must sum to 1 after truncation to sample correctly.
Key Takeaways
Top-k and top-p sampling are techniques to add controlled randomness when choosing the next word in text generation.
Top-k picks from a fixed number of most likely words, while top-p picks from a variable set covering a probability threshold.
These methods help balance between safe, repetitive text and creative, diverse outputs.
Choosing the right parameters is crucial to avoid nonsensical or biased text.
Sampling methods influence not only creativity but also fairness and reliability of AI-generated content.

Practice

(1/5)
1. What does top-k sampling do in text generation?
easy
A. It selects the next word from the top k most likely words.
B. It selects the next word randomly from all possible words.
C. It picks words until their total probability reaches p.
D. It always picks the single most likely next word.

Solution

  1. Step 1: Understand top-k sampling definition

    Top-k sampling limits choices to the top k words with highest probabilities.
  2. Step 2: Compare with other methods

    Random selection from all possible words and picking words until total probability reaches p describe other methods; always picking the single most likely next word is greedy decoding, not sampling.
  3. Final Answer:

    It selects the next word from the top k most likely words. -> Option A
  4. Quick Check:

    Top-k = top k words [OK]
Hint: Top-k means pick from top k words only [OK]
Common Mistakes:
  • Confusing top-k with top-p sampling
  • Thinking top-k picks only one word always
  • Mixing top-k with greedy decoding
2. Which of the following is the correct way to apply top-p sampling in code?
easy
A. Select words until their cumulative probability exceeds p.
B. Select exactly p words with highest probabilities.
C. Select the single word with probability p.
D. Select words randomly ignoring probabilities.

Solution

  1. Step 1: Recall top-p sampling definition

    Top-p sampling chooses the smallest set of words whose total probability is at least p.
  2. Step 2: Evaluate options

    Selecting words until their cumulative probability exceeds p matches this definition. Selecting exactly p words confuses top-p with top-k. Random selection ignoring probabilities and selecting a single word with probability p are incorrect.
  3. Final Answer:

    Select words until their cumulative probability exceeds p. -> Option A
  4. Quick Check:

    Top-p = cumulative probability ≥ p [OK]
Hint: Top-p sums probabilities to reach p [OK]
Common Mistakes:
  • Confusing number of words with cumulative probability
  • Thinking top-p picks fixed number of words
  • Ignoring word probabilities in selection
3. Given these word probabilities sorted descending: {'a': 0.4, 'b': 0.3, 'c': 0.2, 'd': 0.1}, what words are included in top-p sampling with p=0.7?
medium
A. ['a']
B. ['a', 'b', 'c']
C. ['a', 'b']
D. ['a', 'b', 'c', 'd']

Solution

  1. Step 1: Calculate cumulative probabilities

    Sum probabilities in order: 'a' = 0.4, 'a'+'b' = 0.7, 'a'+'b'+'c' = 0.9.
  2. Step 2: Select smallest set ≥ p=0.7

    The smallest set with sum ≥ 0.7 is ['a', 'b'].
  3. Final Answer:

    ['a', 'b'] -> Option C
  4. Quick Check:

    Cumulative sum ≥ 0.7 includes 'a' and 'b' [OK]
Hint: Sum probabilities until ≥ p [OK]
Common Mistakes:
  • Including too many words beyond p
  • Stopping before reaching p
  • Confusing top-p with top-k count
4. You wrote code for top-k sampling but it always picks only one word. What is the likely bug?
medium
A. You summed probabilities instead of sorting words.
B. You set k=1 instead of a larger number.
C. You used top-p sampling code instead of top-k.
D. You forgot to normalize probabilities.

Solution

  1. Step 1: Understand top-k parameter effect

    Setting k=1 means only the single most likely word is chosen.
  2. Step 2: Check other options

    Summing probabilities or mixing methods won't cause always one word; normalization affects probabilities but not count.
  3. Final Answer:

    You set k=1 instead of a larger number. -> Option B
  4. Quick Check:

    k=1 picks only one word [OK]
Hint: Check if k=1 limits output to one word [OK]
Common Mistakes:
  • Confusing top-k and top-p parameters
  • Ignoring parameter values in code
  • Assuming normalization fixes count
5. You want to generate text that balances creativity and coherence. Which approach is best?
hard
A. Use random sampling ignoring probabilities.
B. Use greedy decoding to always pick the most likely word.
C. Use top-k sampling with k=1 only.
D. Use top-k sampling with a moderate k and top-p sampling with p around 0.9 together.

Solution

  1. Step 1: Understand creativity vs coherence tradeoff

    Greedy decoding is too rigid; random sampling is too chaotic; top-k with k=1 is greedy.
  2. Step 2: Combine top-k and top-p for balance

    Using moderate k and p near 0.9 limits choices to plausible words but allows variety, improving naturalness.
  3. Final Answer:

    Use top-k sampling with a moderate k and top-p sampling with p around 0.9 together. -> Option D
  4. Quick Check:

    Combining top-k and top-p balances randomness and coherence [OK]
Hint: Combine top-k and top-p for best text quality [OK]
Common Mistakes:
  • Choosing greedy decoding for creativity
  • Ignoring probability thresholds
  • Using too small k or p values