Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Caching strategies for LLMs in Prompt Engineering / GenAI - Practice Problems & Coding Challenges

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Challenge - 5 Problems
🎖️
Caching Master for LLMs
Get all challenges correct to earn this badge!
Test your skills under time pressure!
🧠 Conceptual
intermediate
1:30remaining
Understanding Cache Hit in LLMs
In the context of caching strategies for large language models (LLMs), what does a cache hit mean?
AThe requested output is found in the cache, so the model can reuse it without recomputing.
BThe model fails to find any relevant data in the cache and must query an external database.
CThe model generates a new response from scratch without using cached data.
DThe cache is full and cannot store any more outputs.
Attempts:
2 left
💡 Hint
Think about what happens when the model can reuse previous results quickly.
Model Choice
intermediate
2:00remaining
Choosing a Cache Type for LLMs
Which caching strategy is best suited for storing frequently requested LLM outputs to reduce latency?
ALeast Recently Used (LRU) cache, which evicts the least recently accessed outputs when full.
BWrite-back cache, where outputs are stored only in cache and written back later to main storage.
CWrite-through cache, where every output is immediately written to the main storage and cache.
DRandom replacement cache, which evicts random outputs when the cache is full.
Attempts:
2 left
💡 Hint
Think about which method keeps the most useful outputs available.
Metrics
advanced
1:30remaining
Evaluating Cache Effectiveness
Given an LLM caching system, which metric best measures how often the cache successfully provides outputs without recomputation?
AThroughput, the number of outputs generated per second.
BLatency, the time taken to generate outputs from scratch.
CCache miss rate, the percentage of requests not found in cache.
DCache hit rate, the percentage of requests found in cache.
Attempts:
2 left
💡 Hint
This metric shows how often the cache helps avoid extra work.
🔧 Debug
advanced
2:00remaining
Identifying Cache Invalidation Issue
An LLM caching system returns outdated responses after the underlying model is updated. What is the most likely cause?
AThe model is overfitting to cached outputs.
BCache hit rate is too low, causing frequent recomputations.
CCache invalidation is not properly implemented, so old outputs remain in cache.
DThe cache size is too large, causing slow lookups.
Attempts:
2 left
💡 Hint
Think about what happens when cached data does not reflect model changes.
Predict Output
expert
2:30remaining
Output of LRU Cache Simulation Code
What is the output of this Python code simulating an LRU cache for LLM outputs?
Prompt Engineering / GenAI
from collections import OrderedDict

class LRUCache:
    def __init__(self, capacity):
        self.cache = OrderedDict()
        self.capacity = capacity

    def get(self, key):
        if key not in self.cache:
            return -1
        self.cache.move_to_end(key)
        return self.cache[key]

    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

cache = LRUCache(2)
cache.put('a', 'output1')
cache.put('b', 'output2')
print(cache.get('a'))
cache.put('c', 'output3')
print(cache.get('b'))
print(cache.get('c'))
A
-1
output2
output3
B
output1
-1
output3
C
output1
output2
output3
D
-1
-1
output3
Attempts:
2 left
💡 Hint
Remember that the cache capacity is 2 and least recently used items get removed.

Practice

(1/5)
1. What is the main purpose of caching in large language models (LLMs)?
easy
A. To save previous answers and avoid repeating work
B. To increase the size of the model
C. To change the model's training data
D. To make the model forget old information

Solution

  1. Step 1: Understand caching concept

    Caching stores previous results so the system can reuse them instead of recalculating.
  2. Step 2: Apply to LLMs context

    In LLMs, caching saves time and resources by reusing answers for repeated inputs.
  3. Final Answer:

    To save previous answers and avoid repeating work -> Option A
  4. Quick Check:

    Caching = Save and reuse answers [OK]
Hint: Caching means saving past answers to reuse them [OK]
Common Mistakes:
  • Thinking caching changes model size
  • Confusing caching with training data updates
  • Believing caching deletes old info
2. Which Python tool is commonly used for simple caching in LLM applications?
easy
A. os.listdir
B. functools.lru_cache
C. math.sqrt
D. random.shuffle

Solution

  1. Step 1: Identify caching tools in Python

    functools.lru_cache is a built-in decorator for caching function results.
  2. Step 2: Check other options

    random.shuffle shuffles lists, math.sqrt calculates square roots, os.listdir lists files; none cache results.
  3. Final Answer:

    functools.lru_cache -> Option B
  4. Quick Check:

    Python caching tool = lru_cache [OK]
Hint: lru_cache is Python's simple caching decorator [OK]
Common Mistakes:
  • Choosing random.shuffle as caching
  • Confusing math functions with caching
  • Picking file system functions
3. Given this Python code using a dictionary cache for LLM responses, what will be printed?
cache = {}
def get_response(input_text):
    if input_text in cache:
        return cache[input_text]
    response = f"Answer for {input_text}"
    cache[input_text] = response
    return response

print(get_response('hello'))
print(get_response('hello'))
medium
A. None\nAnswer for hello
B. Answer for hello\nNone
C. Error: KeyError
D. Answer for hello\nAnswer for hello

Solution

  1. Step 1: Analyze first call get_response('hello')

    Cache is empty, so it creates 'Answer for hello', stores it, and returns it.
  2. Step 2: Analyze second call get_response('hello')

    Input is in cache, so it returns cached 'Answer for hello' without recomputing.
  3. Final Answer:

    Answer for hello Answer for hello -> Option D
  4. Quick Check:

    Cache hit returns saved answer [OK]
Hint: Cache returns saved answer on repeated input [OK]
Common Mistakes:
  • Assuming second call returns None
  • Expecting error on repeated key
  • Thinking cache clears automatically
4. This code tries to cache LLM outputs but has a bug. What is the error?
cache = {}
def get_response(input_text):
    if input_text in cache:
        return cache[input_text]
    response = f"Answer for {input_text}"
    cache = {input_text: response}
    return response

print(get_response('test'))
print(get_response('test'))
medium
A. Cache is reset each call, losing previous entries
B. generate_answer function is undefined
C. Syntax error in dictionary assignment
D. Infinite recursion in get_response

Solution

  1. Step 1: Check cache update line

    cache = {input_text: response} replaces whole cache dict, losing old data.
  2. Step 2: Understand effect on repeated calls

    Each call resets cache, so repeated inputs are not cached properly.
  3. Final Answer:

    Cache is reset each call, losing previous entries -> Option A
  4. Quick Check:

    Cache replaced, not updated [OK]
Hint: Use cache[key] = value to update, not assign new dict [OK]
Common Mistakes:
  • Thinking generate_answer is missing
  • Assuming syntax error in dict
  • Believing recursion happens
5. You want to cache partial results of LLM calls to speed up responses when inputs share common prefixes. Which caching strategy best fits this need?
hard
A. Use random sampling to cache some inputs
B. Cache only full input strings as dictionary keys
C. Use a trie (prefix tree) to store cached outputs by input prefixes
D. Clear cache after every call to save memory

Solution

  1. Step 1: Understand prefix sharing in inputs

    Inputs sharing prefixes can reuse partial results if cached by prefix.
  2. Step 2: Identify suitable data structure

    A trie (prefix tree) efficiently stores and retrieves data by prefixes, ideal for this case.
  3. Final Answer:

    Use a trie (prefix tree) to store cached outputs by input prefixes -> Option C
  4. Quick Check:

    Prefix caching = trie structure [OK]
Hint: Trie caches shared prefixes efficiently [OK]
Common Mistakes:
  • Caching only full inputs misses prefix reuse
  • Random caching is inefficient
  • Clearing cache wastes saved data