Introduction
When using large language models (LLMs), responses can take time and computing power. Caching helps by saving answers to repeated questions, so the model doesn't have to work from scratch every time.
Imagine a busy coffee shop where customers often order the same drinks. Instead of making each drink from scratch every time, the barista keeps some popular drinks ready or remembers how to quickly prepare them. This saves time and keeps customers happy.
┌─────────────────────────────┐
│ User Input │
└─────────────┬───────────────┘
│
┌────────▼────────┐
│ Check Response │
│ Cache │
└───────┬─────────┘
│ Yes
▼
┌───────────────────┐
│ Return Cached │
│ Response │
└───────────────────┘
│ No
▼
┌───────────────────┐
│ Process Input │
│ (Context, Embeds) │
└───────┬───────────┘
│
┌───────▼───────────┐
│ Generate Response │
└───────┬───────────┘
│
┌───────▼───────────┐
│ Cache Results │
└───────┬───────────┘
│
▼
┌───────────────────┐
│ Return Response │
└───────────────────┘