| Users | Search Requests per Second | Data Size | System Changes |
|---|---|---|---|
| 100 | 10-50 | Small index (MBs) | Single search server, simple index, no caching needed |
| 10,000 | 1,000-5,000 | Medium index (GBs) | Introduce caching, optimize index, add load balancer |
| 1,000,000 | 100,000+ | Large index (TBs) | Distributed search cluster, sharded indexes, CDN for static results |
| 100,000,000 | 10M+ | Massive index (PBs) | Multi-region clusters, advanced sharding, heavy caching, AI-based ranking |
Search functionality design in LLD - Scalability & System Analysis
Start learning this pattern below
Jump into concepts and practice - no test required
The search index storage and query processing become the first bottleneck as user requests grow. At small scale, a single server can handle indexing and queries. But as traffic and data size increase, the CPU and memory needed to process complex queries and maintain the index exceed one server's capacity.
- Horizontal scaling: Add more search servers and distribute queries using a load balancer.
- Sharding: Split the search index into smaller parts across servers to reduce query load per server.
- Caching: Cache frequent queries and results in memory (e.g., Redis) to reduce repeated processing.
- CDN: Use Content Delivery Networks to serve static search assets and reduce latency globally.
- Index optimization: Use efficient data structures and incremental indexing to speed up updates and queries.
Assuming 1M users generate 100K search requests per second:
- Each search request ~50KB data transferred -> 100K * 50KB = ~5GB/s bandwidth needed.
- Storage for index ~1TB for large dataset.
- Each server handles ~5,000 QPS -> need ~20 search servers.
- Memory per server ~64GB to hold index shards and cache.
Start by clarifying the expected traffic and data size. Then identify the main bottleneck (index storage and query processing). Discuss scaling strategies step-by-step: caching, horizontal scaling, sharding, and CDN. Always justify why each solution fits the bottleneck.
Your database handles 1000 QPS for search queries. Traffic grows 10x to 10,000 QPS. What do you do first?
Answer: Add read replicas or caching to reduce load on the main database before scaling application servers. This addresses the database bottleneck first.
Practice
Solution
Step 1: Understand the role of an index in search
An index maps keywords to data entries, enabling fast lookup instead of scanning all data.Step 2: Identify the correct purpose
Since search needs to find matching data quickly, the index helps achieve this by direct access.Final Answer:
To quickly find data entries matching search keywords -> Option DQuick Check:
Index = Fast keyword lookup [OK]
- Confusing index with data compression
- Thinking index stores passwords
- Assuming index speeds up image display
Solution
Step 1: Recall common data structures for fast lookup
Hash maps provide average O(1) time for key-based access, ideal for mapping keywords to data.Step 2: Eliminate other options
Linked lists, stacks, and queues do not provide efficient direct lookup by key.Final Answer:
Hash map -> Option AQuick Check:
Hash map = Fast key lookup [OK]
- Choosing linked list which is slow for lookup
- Confusing stack or queue with key-value storage
- Ignoring average O(1) lookup time of hash maps
Solution
Step 1: Identify documents for each keyword
'apple' maps to documents [1, 3, 5], 'banana' maps to [2, 3].Step 2: Find intersection of document lists
Documents containing both keywords are in both lists. Intersection of [1, 3, 5] and [2, 3] is [3].Final Answer:
[3] -> Option BQuick Check:
Intersection = [3] [OK]
- Merging lists instead of intersecting
- Confusing union with intersection
- Ignoring common documents
index = {}
keywords = ['apple', 'banana', 'apple']
docs = [1, 2, 3]
for i in range(len(keywords)):
index[keywords[i]] = docs[i]
print(index)
What is the bug in this code?Solution
Step 1: Analyze how index is updated
The loop assigns index[keyword] = doc, so duplicate keywords overwrite previous values.Step 2: Identify the bug
For 'apple', first doc 1 is stored, then overwritten by doc 3, losing doc 1.Final Answer:
It overwrites previous document IDs for duplicate keywords -> Option AQuick Check:
Duplicate keys overwrite values in hash map [OK]
- Thinking loop range is wrong
- Assuming index is not initialized
- Confusing data structure type
Solution
Step 1: Consider scalability and speed needs
Millions of products and high traffic require fast, scalable search with distributed storage and caching.Step 2: Evaluate options
Use an inverted index stored in a distributed NoSQL database with caching layers uses inverted index (fast keyword lookup), distributed NoSQL (scalable), and caching (speed). Others are slow or incomplete.Final Answer:
Use an inverted index stored in a distributed NoSQL database with caching layers -> Option CQuick Check:
Inverted index + distributed storage + cache = scalable fast search [OK]
- Choosing full table scan for large data
- Filtering on client side for millions of items
- Indexing only a small subset of data
