You need to design a search system that indexes documents with rich metadata (tags, authors, dates). Which architecture best supports fast, scalable search with frequent metadata updates?
Think about systems optimized for search and handling frequent updates efficiently.
Distributed search engines like Elasticsearch are designed for fast, scalable search and can handle frequent metadata updates by indexing separately. Relational databases are less efficient for large-scale full-text search. Scanning all keys or files sequentially is slow and not scalable.
Your metadata search service expects 10 million documents with an average of 20 metadata fields each. You expect 1000 queries per second. What is the best way to estimate the required hardware capacity?
Consider both data size and query complexity, and use real-world benchmarks.
Estimating capacity requires understanding index size, query complexity, and benchmarking similar systems. Simple document counts or ignoring metadata size leads to inaccurate estimates.
You must choose a data structure for searching documents by metadata. Which option best fits a use case with complex relationships between metadata (e.g., authors collaborating, hierarchical tags)?
Consider which data structure naturally represents relationships and supports complex queries.
Graph databases are designed to model and query complex relationships efficiently. Inverted indexes excel at keyword search but not relationship queries. Relational joins can be expensive at scale. Key-value stores lack query flexibility.
In a search system with frequent metadata updates, what is the main tradeoff when choosing between real-time indexing and batch indexing?
Think about how update frequency affects system performance and data freshness.
Real-time indexing keeps data fresh but adds load due to constant updates. Batch indexing reduces load by updating less often but causes delays in reflecting changes.
Trace the request flow when a user searches for documents by metadata in a distributed search system with multiple shards and a metadata cache layer. Which sequence correctly describes the flow?
Consider cache lookup before querying shards and updating cache after aggregation.
The correct flow starts with user query to API gateway, cache check, forwarding to shards on miss, shards returning partial results, aggregation, returning results, and cache update.
