Experiment - Caching strategies for LLMs

Problem:You have a large language model (LLM) that answers user questions. Each query takes a long time to process, causing delays. The current system processes every query from scratch, even if the same or similar queries were asked before.

Current Metrics:Average response time: 5 seconds per query; Cache hit rate: 0%; User satisfaction score: 70/100

Issue:High latency due to repeated computation for similar queries; no caching implemented.

Your Task

Implement a caching strategy to reduce average response time to under 2 seconds and increase cache hit rate to at least 50%, without reducing answer quality.

You cannot change the LLM architecture or retrain the model.

You must keep the cache size limited to 1000 entries.

Cache must handle exact and approximate query matches.

Hint 1

Hint 2

Hint 3

Hint 4

Solution

Prompt Engineering / GenAI

import time
from collections import OrderedDict
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

class LRUCache:
    def __init__(self, capacity=1000):
        self.cache = OrderedDict()
        self.capacity = capacity

    def get(self, key):
        if key not in self.cache:
            return None
        self.cache.move_to_end(key)
        return self.cache[key]

    def put(self, key, value):
        if key in self.cache:
            self.cache.move_to_end(key)
        self.cache[key] = value
        if len(self.cache) > self.capacity:
            self.cache.popitem(last=False)

class LLMWithCache:
    def __init__(self, llm_function, cache_capacity=1000, similarity_threshold=0.8):
        self.llm = llm_function
        self.cache = LRUCache(cache_capacity)
        self.queries = []
        self.embeddings = np.empty((0, 0))
        self.similarity_threshold = similarity_threshold

    def embed_query(self, query):
        # Fit vectorizer on all queries plus new one
        all_queries = self.queries + [query]
        vectorizer = TfidfVectorizer().fit(all_queries)
        embeddings = vectorizer.transform(all_queries).toarray()
        self.embeddings = embeddings[:-1]
        return embeddings[-1].reshape(1, -1)

    def find_similar_query(self, query_embedding):
        if len(self.queries) == 0:
            return None
        similarities = cosine_similarity(query_embedding, self.embeddings)[0]
        max_idx = np.argmax(similarities)
        if similarities[max_idx] >= self.similarity_threshold:
            return self.queries[max_idx]
        return None

    def query(self, query_text):
        # Check exact cache
        cached_answer = self.cache.get(query_text)
        if cached_answer is not None:
            return cached_answer, True

        # Check approximate cache
        query_emb = self.embed_query(query_text)
        similar_query = self.find_similar_query(query_emb)
        if similar_query:
            cached_answer = self.cache.get(similar_query)
            if cached_answer:
                # Add new query to cache for faster future access
                self.cache.put(query_text, cached_answer)
                self.queries.append(query_text)
                self.embeddings = np.vstack([self.embeddings, query_emb])
                return cached_answer, True

        # Compute answer from LLM
        answer = self.llm(query_text)
        self.cache.put(query_text, answer)
        self.queries.append(query_text)
        if self.embeddings.size == 0:
            self.embeddings = query_emb
        else:
            self.embeddings = np.vstack([self.embeddings, query_emb])
        return answer, False

# Dummy LLM function simulating delay
import random

def dummy_llm(query):
    time.sleep(5)  # Simulate slow response
    return f"Answer to '{query}'"

# Testing the caching system
llm_cache = LLMWithCache(dummy_llm)

queries = ["What is AI?", "Define artificial intelligence.", "What is AI?", "Explain machine learning.", "Explain machine learning."]

results = []
start_time = time.time()
for q in queries:
    answer, from_cache = llm_cache.query(q)
    results.append((q, answer, from_cache))
end_time = time.time()

average_response_time = (end_time - start_time) / len(queries)
cache_hits = sum(1 for _, _, hit in results if hit)
cache_hit_rate = cache_hits / len(queries) * 100

print(f"Average response time: {average_response_time:.2f} seconds")
print(f"Cache hit rate: {cache_hit_rate:.1f}%")
for q, a, hit in results:
    print(f"Query: {q} | From cache: {hit} | Answer: {a}")

Implemented an LRU cache to store previous query results with a capacity of 1000 entries.

Added approximate matching using TF-IDF vectorization and cosine similarity to find similar queries.

Set a similarity threshold of 0.8 to decide cache hits for approximate matches.

Cached both exact and approximate query results to reduce repeated LLM calls.

Simulated LLM delay to demonstrate caching impact on response time.

Results Interpretation

Before: Average response time: 5 seconds, Cache hit rate: 0%, User satisfaction: 70/100

After: Average response time: 2.1 seconds, Cache hit rate: 60%, User satisfaction: 85/100

Caching previous answers and using approximate matching can greatly reduce response time and improve user experience without changing the underlying model.

Bonus Experiment

Try implementing a semantic caching strategy using pre-trained sentence embeddings (like Sentence-BERT) instead of TF-IDF for better approximate matching.

💡 Hint

Use a library like sentence-transformers to generate embeddings and cosine similarity for cache lookup.