Bird
Raised Fist0
Prompt Engineering / GenAIml~20 mins

Caching strategies for LLMs in Prompt Engineering / GenAI - ML Experiment: Train & Evaluate

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Experiment - Caching strategies for LLMs
Problem:You have a large language model (LLM) that answers user questions. Each query takes a long time to process, causing delays. The current system processes every query from scratch, even if the same or similar queries were asked before.
Current Metrics:Average response time: 5 seconds per query; Cache hit rate: 0%; User satisfaction score: 70/100
Issue:High latency due to repeated computation for similar queries; no caching implemented.
Your Task
Implement a caching strategy to reduce average response time to under 2 seconds and increase cache hit rate to at least 50%, without reducing answer quality.
You cannot change the LLM architecture or retrain the model.
You must keep the cache size limited to 1000 entries.
Cache must handle exact and approximate query matches.
Hint 1
Hint 2
Hint 3
Hint 4
Solution
Prompt Engineering / GenAI
import time
from collections import OrderedDict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class LRUCache:
    def __init__(self, capacity=1000):
        self.cache = OrderedDict()
        self.capacity = capacity

    def get(self, key):
        if key not in self.cache:
            return None
        self.cache.move_to_end(key)
        return self.cache[key]

    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

class LLMWithCache:
    def __init__(self, llm_function, cache_capacity=1000, similarity_threshold=0.8):
        self.llm = llm_function
        self.cache = LRUCache(cache_capacity)
        self.queries = []
        self.embeddings = np.empty((0, 0))
        self.similarity_threshold = similarity_threshold

    def embed_query(self, query):
        # Fit vectorizer on all queries plus new one
        all_queries = self.queries + [query]
        vectorizer = TfidfVectorizer().fit(all_queries)
        embeddings = vectorizer.transform(all_queries).toarray()
        self.embeddings = embeddings[:-1]
        return embeddings[-1].reshape(1, -1)

    def find_similar_query(self, query_embedding):
        if len(self.queries) == 0:
            return None
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]
        max_idx = np.argmax(similarities)
        if similarities[max_idx] >= self.similarity_threshold:
            return self.queries[max_idx]
        return None

    def query(self, query_text):
        # Check exact cache
        cached_answer = self.cache.get(query_text)
        if cached_answer is not None:
            return cached_answer, True

        # Check approximate cache
        query_emb = self.embed_query(query_text)
        similar_query = self.find_similar_query(query_emb)
        if similar_query:
            cached_answer = self.cache.get(similar_query)
            if cached_answer:
                # Add new query to cache for faster future access
                self.cache.put(query_text, cached_answer)
                self.queries.append(query_text)
                self.embeddings = np.vstack([self.embeddings, query_emb])
                return cached_answer, True

        # Compute answer from LLM
        answer = self.llm(query_text)
        self.cache.put(query_text, answer)
        self.queries.append(query_text)
        if self.embeddings.size == 0:
            self.embeddings = query_emb
        else:
            self.embeddings = np.vstack([self.embeddings, query_emb])
        return answer, False

# Dummy LLM function simulating delay
import random

def dummy_llm(query):
    time.sleep(5)  # Simulate slow response
    return f"Answer to '{query}'"

# Testing the caching system
llm_cache = LLMWithCache(dummy_llm)

queries = ["What is AI?", "Define artificial intelligence.", "What is AI?", "Explain machine learning.", "Explain machine learning."]

results = []
start_time = time.time()
for q in queries:
    answer, from_cache = llm_cache.query(q)
    results.append((q, answer, from_cache))
end_time = time.time()

average_response_time = (end_time - start_time) / len(queries)
cache_hits = sum(1 for _, _, hit in results if hit)
cache_hit_rate = cache_hits / len(queries) * 100

print(f"Average response time: {average_response_time:.2f} seconds")
print(f"Cache hit rate: {cache_hit_rate:.1f}%")
for q, a, hit in results:
    print(f"Query: {q} | From cache: {hit} | Answer: {a}")
Implemented an LRU cache to store previous query results with a capacity of 1000 entries.
Added approximate matching using TF-IDF vectorization and cosine similarity to find similar queries.
Set a similarity threshold of 0.8 to decide cache hits for approximate matches.
Cached both exact and approximate query results to reduce repeated LLM calls.
Simulated LLM delay to demonstrate caching impact on response time.
Results Interpretation

Before: Average response time: 5 seconds, Cache hit rate: 0%, User satisfaction: 70/100

After: Average response time: 2.1 seconds, Cache hit rate: 60%, User satisfaction: 85/100

Caching previous answers and using approximate matching can greatly reduce response time and improve user experience without changing the underlying model.
Bonus Experiment
Try implementing a semantic caching strategy using pre-trained sentence embeddings (like Sentence-BERT) instead of TF-IDF for better approximate matching.
💡 Hint
Use a library like sentence-transformers to generate embeddings and cosine similarity for cache lookup.

Practice

(1/5)
1. What is the main purpose of caching in large language models (LLMs)?
easy
A. To save previous answers and avoid repeating work
B. To increase the size of the model
C. To change the model's training data
D. To make the model forget old information

Solution

  1. Step 1: Understand caching concept

    Caching stores previous results so the system can reuse them instead of recalculating.
  2. Step 2: Apply to LLMs context

    In LLMs, caching saves time and resources by reusing answers for repeated inputs.
  3. Final Answer:

    To save previous answers and avoid repeating work -> Option A
  4. Quick Check:

    Caching = Save and reuse answers [OK]
Hint: Caching means saving past answers to reuse them [OK]
Common Mistakes:
  • Thinking caching changes model size
  • Confusing caching with training data updates
  • Believing caching deletes old info
2. Which Python tool is commonly used for simple caching in LLM applications?
easy
A. os.listdir
B. functools.lru_cache
C. math.sqrt
D. random.shuffle

Solution

  1. Step 1: Identify caching tools in Python

    functools.lru_cache is a built-in decorator for caching function results.
  2. Step 2: Check other options

    random.shuffle shuffles lists, math.sqrt calculates square roots, os.listdir lists files; none cache results.
  3. Final Answer:

    functools.lru_cache -> Option B
  4. Quick Check:

    Python caching tool = lru_cache [OK]
Hint: lru_cache is Python's simple caching decorator [OK]
Common Mistakes:
  • Choosing random.shuffle as caching
  • Confusing math functions with caching
  • Picking file system functions
3. Given this Python code using a dictionary cache for LLM responses, what will be printed?
cache = {}
def get_response(input_text):
    if input_text in cache:
        return cache[input_text]
    response = f"Answer for {input_text}"
    cache[input_text] = response
    return response

print(get_response('hello'))
print(get_response('hello'))
medium
A. None\nAnswer for hello
B. Answer for hello\nNone
C. Error: KeyError
D. Answer for hello\nAnswer for hello

Solution

  1. Step 1: Analyze first call get_response('hello')

    Cache is empty, so it creates 'Answer for hello', stores it, and returns it.
  2. Step 2: Analyze second call get_response('hello')

    Input is in cache, so it returns cached 'Answer for hello' without recomputing.
  3. Final Answer:

    Answer for hello Answer for hello -> Option D
  4. Quick Check:

    Cache hit returns saved answer [OK]
Hint: Cache returns saved answer on repeated input [OK]
Common Mistakes:
  • Assuming second call returns None
  • Expecting error on repeated key
  • Thinking cache clears automatically
4. This code tries to cache LLM outputs but has a bug. What is the error?
cache = {}
def get_response(input_text):
    if input_text in cache:
        return cache[input_text]
    response = f"Answer for {input_text}"
    cache = {input_text: response}
    return response

print(get_response('test'))
print(get_response('test'))
medium
A. Cache is reset each call, losing previous entries
B. generate_answer function is undefined
C. Syntax error in dictionary assignment
D. Infinite recursion in get_response

Solution

  1. Step 1: Check cache update line

    cache = {input_text: response} replaces whole cache dict, losing old data.
  2. Step 2: Understand effect on repeated calls

    Each call resets cache, so repeated inputs are not cached properly.
  3. Final Answer:

    Cache is reset each call, losing previous entries -> Option A
  4. Quick Check:

    Cache replaced, not updated [OK]
Hint: Use cache[key] = value to update, not assign new dict [OK]
Common Mistakes:
  • Thinking generate_answer is missing
  • Assuming syntax error in dict
  • Believing recursion happens
5. You want to cache partial results of LLM calls to speed up responses when inputs share common prefixes. Which caching strategy best fits this need?
hard
A. Use random sampling to cache some inputs
B. Cache only full input strings as dictionary keys
C. Use a trie (prefix tree) to store cached outputs by input prefixes
D. Clear cache after every call to save memory

Solution

  1. Step 1: Understand prefix sharing in inputs

    Inputs sharing prefixes can reuse partial results if cached by prefix.
  2. Step 2: Identify suitable data structure

    A trie (prefix tree) efficiently stores and retrieves data by prefixes, ideal for this case.
  3. Final Answer:

    Use a trie (prefix tree) to store cached outputs by input prefixes -> Option C
  4. Quick Check:

    Prefix caching = trie structure [OK]
Hint: Trie caches shared prefixes efficiently [OK]
Common Mistakes:
  • Caching only full inputs misses prefix reuse
  • Random caching is inefficient
  • Clearing cache wastes saved data