Bird
Raised Fist0
HLDsystem_design~25 mins

Design a web crawler in HLD - System Design Exercise

Choose your learning style9 modes available
Design: Web Crawler
Design focuses on the crawling system including URL management, fetching, parsing, and storage. Indexing and search functionalities are out of scope.
Functional Requirements
FR1: Crawl web pages starting from a list of seed URLs
FR2: Extract and store page content and metadata
FR3: Follow links to discover new pages
FR4: Respect robots.txt rules and crawl delays
FR5: Handle millions of URLs efficiently
FR6: Avoid crawling the same URL multiple times
FR7: Support prioritization of URLs to crawl
FR8: Provide a way to pause and resume crawling
Non-Functional Requirements
NFR1: Scale to crawl at least 10 million pages per day
NFR2: Latency for fetching a page should be under 2 seconds on average
NFR3: System availability should be 99.9%
NFR4: Respect politeness to avoid overloading websites
NFR5: Handle network failures and retries gracefully
Think Before You Design
Questions to Ask
❓ Question 1
❓ Question 2
❓ Question 3
❓ Question 4
❓ Question 5
❓ Question 6
❓ Question 7
Key Components
URL Frontier Manager
Fetcher (HTTP client)
Parser (HTML and link extractor)
Robots.txt Manager
URL Deduplication Store
Data Storage (for pages and metadata)
Scheduler and Prioritizer
Monitoring and Logging
Design Patterns
Producer-Consumer for fetching and parsing
Distributed Queue for URL frontier
Bloom Filters or Hash Sets for deduplication
Rate Limiting for politeness
Retry and Backoff strategies
Sharding for scaling storage
Reference Architecture
  +-------------------+       +-------------------+       +-------------------+
  | Seed URLs         |       | URL Frontier      |       | Robots.txt Manager |
  +---------+---------+       +---------+---------+       +---------+---------+
            |                           |                           |
            v                           v                           v
  +-------------------+       +-------------------+       +-------------------+
  | URL Deduplication |<----->| Scheduler &       |<----->| Politeness &      |
  | Store             |       | Prioritizer       |       | Rate Limiter      |
  +---------+---------+       +---------+---------+       +---------+---------+
            |                           |                           |
            v                           v                           v
  +-------------------+       +-------------------+       +-------------------+
  | Fetcher (HTTP)    |<----->| Parser (HTML &    |<----->| Data Storage      |
  +-------------------+       | Link Extractor)   |       | (Pages & Metadata)|
                              +-------------------+       +-------------------+
Components
URL Frontier Manager
Distributed Queue (e.g., Kafka, RabbitMQ)
Stores URLs to be crawled and manages their prioritization
Fetcher
HTTP Client Library (e.g., libcurl, Requests)
Fetches web pages from the internet respecting politeness
Parser
HTML Parser (e.g., BeautifulSoup, jsoup)
Extracts page content and discovers new URLs from fetched pages
Robots.txt Manager
Custom or existing robots.txt parser
Checks and enforces crawling rules per website
URL Deduplication Store
Bloom Filter or Distributed Hash Set (e.g., Redis, Cassandra)
Prevents crawling the same URL multiple times
Scheduler & Prioritizer
Custom scheduling logic with priority queues
Decides which URLs to crawl next based on priority and politeness
Data Storage
Distributed Storage (e.g., HDFS, S3, NoSQL DB)
Stores crawled page content and metadata for later use
Monitoring & Logging
Prometheus, ELK Stack
Tracks system health, crawl progress, and errors
Request Flow
1. Start with seed URLs loaded into the URL Frontier Manager.
2. Scheduler picks URLs from the frontier respecting priority and politeness.
3. Robots.txt Manager checks if crawling the URL is allowed.
4. Fetcher downloads the page content if allowed.
5. Parser extracts page content and finds new URLs.
6. New URLs are checked against the URL Deduplication Store to avoid repeats.
7. Unique new URLs are added back to the URL Frontier Manager.
8. Fetched page content and metadata are stored in Data Storage.
9. Monitoring tracks progress and errors throughout the process.
Database Schema
Entities: - URL: {id (PK), url, status, last_crawled, priority} - PageContent: {url_id (FK), content, content_type, fetch_time} - RobotsTxtRules: {domain, rules, fetched_time} Relationships: - URL to PageContent is 1:1 - URL to RobotsTxtRules via domain matching (not direct FK)
Scaling Discussion
Bottlenecks
URL Frontier Manager becomes a bottleneck with millions of URLs
Fetcher limited by network bandwidth and latency
URL Deduplication Store grows large and slow
Data Storage size and write throughput
Scheduler complexity with many URLs and politeness constraints
Solutions
Partition URL Frontier by domain or hash to distribute load
Use multiple fetcher instances distributed geographically
Use scalable probabilistic data structures like Bloom filters with periodic resets
Use distributed storage systems with sharding and replication
Implement domain-based scheduling to parallelize while respecting politeness
Interview Tips
Time: 10 minutes for requirements and clarifications, 15 minutes for architecture and components, 10 minutes for scaling discussion, 10 minutes for Q&A
Clarify scale and politeness requirements upfront
Explain URL frontier and deduplication importance
Discuss how to respect robots.txt and crawl delays
Describe components and their interactions clearly
Address scaling challenges with partitioning and distribution
Mention failure handling and monitoring for reliability