Design: Web Crawler
Design focuses on the crawling system including URL management, fetching, parsing, and storage. Indexing and search functionalities are out of scope.
Functional Requirements
FR1: Crawl web pages starting from a list of seed URLs
FR2: Extract and store page content and metadata
FR3: Follow links to discover new pages
FR4: Respect robots.txt rules and crawl delays
FR5: Handle millions of URLs efficiently
FR6: Avoid crawling the same URL multiple times
FR7: Support prioritization of URLs to crawl
FR8: Provide a way to pause and resume crawling
Non-Functional Requirements
NFR1: Scale to crawl at least 10 million pages per day
NFR2: Latency for fetching a page should be under 2 seconds on average
NFR3: System availability should be 99.9%
NFR4: Respect politeness to avoid overloading websites
NFR5: Handle network failures and retries gracefully
