Bird
Raised Fist0
Prompt Engineering / GenAIml~6 mins

Caching strategies for LLMs in Prompt Engineering / GenAI - Full Explanation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
When using large language models (LLMs), responses can take time and computing power. Caching helps by saving answers to repeated questions, so the model doesn't have to work from scratch every time.
Explanation
Response Caching
This strategy saves the exact answers the LLM gives for specific inputs. When the same input appears again, the saved answer is reused instantly without asking the model again.
Response caching speeds up repeated queries by reusing previous answers.
Partial Output Caching
Instead of saving the full answer, this method stores parts of the output or intermediate results. It helps when answers share common sections, allowing reuse of those parts to build new responses faster.
Partial output caching saves time by reusing shared parts of answers.
Context Window Caching
LLMs use a limited context window to understand input. This strategy caches processed context chunks so the model can quickly recall relevant information without reprocessing everything.
Caching context chunks reduces repeated processing of input history.
Embedding Caching
Embeddings are numeric summaries of text used by LLMs to understand meaning. Caching embeddings for common inputs avoids recalculating them, speeding up similarity searches or related tasks.
Embedding caching saves time by reusing computed text summaries.
Real World Analogy

Imagine a busy coffee shop where customers often order the same drinks. Instead of making each drink from scratch every time, the barista keeps some popular drinks ready or remembers how to quickly prepare them. This saves time and keeps customers happy.

Response Caching → Serving a pre-made popular coffee instantly when a customer orders it again
Partial Output Caching → Using pre-prepared coffee shots or milk foam that can be combined to make different drinks faster
Context Window Caching → Remembering recent orders to quickly prepare similar drinks without starting from zero
Embedding Caching → Having a recipe book with summaries of popular drinks to quickly find how to make them
Diagram
Diagram
┌─────────────────────────────┐
│        User Input            │
└─────────────┬───────────────┘
              │
     ┌────────▼────────┐
     │ Check Response   │
     │ Cache           │
     └───────┬─────────┘
             │ Yes
             ▼
   ┌───────────────────┐
   │ Return Cached      │
   │ Response          │
   └───────────────────┘
             │ No
             ▼
   ┌───────────────────┐
   │ Process Input     │
   │ (Context, Embeds) │
   └───────┬───────────┘
           │
   ┌───────▼───────────┐
   │ Generate Response │
   └───────┬───────────┘
           │
   ┌───────▼───────────┐
   │ Cache Results     │
   └───────┬───────────┘
           │
           ▼
   ┌───────────────────┐
   │ Return Response   │
   └───────────────────┘
This diagram shows how caching checks for saved answers before processing input and generating new responses.
Key Facts
Response CachingStores full answers to reuse for identical inputs.
Partial Output CachingSaves parts of answers to build new responses faster.
Context WindowThe limited amount of recent input the LLM can consider at once.
EmbeddingA numeric summary representing the meaning of text.
CachingSaving data temporarily to speed up future access.
Common Confusions
Caching means storing every possible answer in advance.
Caching means storing every possible answer in advance. Caching only saves answers or parts for inputs that have already been processed, not all possible questions.
Cached responses are always perfect and up-to-date.
Cached responses are always perfect and up-to-date. Cached answers may become outdated if the model or data changes, so caches need refreshing.
Embedding caching stores the full text input.
Embedding caching stores the full text input. Embedding caching stores numeric summaries, not the original text.
Summary
Caching saves time by reusing previous answers or parts of answers instead of generating them again.
Different caching strategies focus on full responses, partial outputs, context chunks, or embeddings.
Effective caching improves speed and reduces computing costs when using large language models.

Practice

(1/5)
1. What is the main purpose of caching in large language models (LLMs)?
easy
A. To save previous answers and avoid repeating work
B. To increase the size of the model
C. To change the model's training data
D. To make the model forget old information

Solution

  1. Step 1: Understand caching concept

    Caching stores previous results so the system can reuse them instead of recalculating.
  2. Step 2: Apply to LLMs context

    In LLMs, caching saves time and resources by reusing answers for repeated inputs.
  3. Final Answer:

    To save previous answers and avoid repeating work -> Option A
  4. Quick Check:

    Caching = Save and reuse answers [OK]
Hint: Caching means saving past answers to reuse them [OK]
Common Mistakes:
  • Thinking caching changes model size
  • Confusing caching with training data updates
  • Believing caching deletes old info
2. Which Python tool is commonly used for simple caching in LLM applications?
easy
A. os.listdir
B. functools.lru_cache
C. math.sqrt
D. random.shuffle

Solution

  1. Step 1: Identify caching tools in Python

    functools.lru_cache is a built-in decorator for caching function results.
  2. Step 2: Check other options

    random.shuffle shuffles lists, math.sqrt calculates square roots, os.listdir lists files; none cache results.
  3. Final Answer:

    functools.lru_cache -> Option B
  4. Quick Check:

    Python caching tool = lru_cache [OK]
Hint: lru_cache is Python's simple caching decorator [OK]
Common Mistakes:
  • Choosing random.shuffle as caching
  • Confusing math functions with caching
  • Picking file system functions
3. Given this Python code using a dictionary cache for LLM responses, what will be printed?
cache = {}
def get_response(input_text):
    if input_text in cache:
        return cache[input_text]
    response = f"Answer for {input_text}"
    cache[input_text] = response
    return response

print(get_response('hello'))
print(get_response('hello'))
medium
A. None\nAnswer for hello
B. Answer for hello\nNone
C. Error: KeyError
D. Answer for hello\nAnswer for hello

Solution

  1. Step 1: Analyze first call get_response('hello')

    Cache is empty, so it creates 'Answer for hello', stores it, and returns it.
  2. Step 2: Analyze second call get_response('hello')

    Input is in cache, so it returns cached 'Answer for hello' without recomputing.
  3. Final Answer:

    Answer for hello Answer for hello -> Option D
  4. Quick Check:

    Cache hit returns saved answer [OK]
Hint: Cache returns saved answer on repeated input [OK]
Common Mistakes:
  • Assuming second call returns None
  • Expecting error on repeated key
  • Thinking cache clears automatically
4. This code tries to cache LLM outputs but has a bug. What is the error?
cache = {}
def get_response(input_text):
    if input_text in cache:
        return cache[input_text]
    response = f"Answer for {input_text}"
    cache = {input_text: response}
    return response

print(get_response('test'))
print(get_response('test'))
medium
A. Cache is reset each call, losing previous entries
B. generate_answer function is undefined
C. Syntax error in dictionary assignment
D. Infinite recursion in get_response

Solution

  1. Step 1: Check cache update line

    cache = {input_text: response} replaces whole cache dict, losing old data.
  2. Step 2: Understand effect on repeated calls

    Each call resets cache, so repeated inputs are not cached properly.
  3. Final Answer:

    Cache is reset each call, losing previous entries -> Option A
  4. Quick Check:

    Cache replaced, not updated [OK]
Hint: Use cache[key] = value to update, not assign new dict [OK]
Common Mistakes:
  • Thinking generate_answer is missing
  • Assuming syntax error in dict
  • Believing recursion happens
5. You want to cache partial results of LLM calls to speed up responses when inputs share common prefixes. Which caching strategy best fits this need?
hard
A. Use random sampling to cache some inputs
B. Cache only full input strings as dictionary keys
C. Use a trie (prefix tree) to store cached outputs by input prefixes
D. Clear cache after every call to save memory

Solution

  1. Step 1: Understand prefix sharing in inputs

    Inputs sharing prefixes can reuse partial results if cached by prefix.
  2. Step 2: Identify suitable data structure

    A trie (prefix tree) efficiently stores and retrieves data by prefixes, ideal for this case.
  3. Final Answer:

    Use a trie (prefix tree) to store cached outputs by input prefixes -> Option C
  4. Quick Check:

    Prefix caching = trie structure [OK]
Hint: Trie caches shared prefixes efficiently [OK]
Common Mistakes:
  • Caching only full inputs misses prefix reuse
  • Random caching is inefficient
  • Clearing cache wastes saved data