HLDsystem_design~25 mins

Design a web crawler in HLD - System Design Exercise

Choose your learning style9 modes available

Learn Why Deep Arch Practice Challenge Design Recall Scale

Design: Web Crawler

Design focuses on the crawling system including URL management, fetching, parsing, and storage. Indexing and search functionalities are out of scope.

Functional Requirements

FR1: Crawl web pages starting from a list of seed URLs

FR2: Extract and store page content and metadata

FR3: Follow links to discover new pages

FR4: Respect robots.txt rules and crawl delays

FR5: Handle millions of URLs efficiently

FR6: Avoid crawling the same URL multiple times

FR7: Support prioritization of URLs to crawl

FR8: Provide a way to pause and resume crawling

Non-Functional Requirements

NFR1: Scale to crawl at least 10 million pages per day

NFR2: Latency for fetching a page should be under 2 seconds on average

NFR3: System availability should be 99.9%

NFR4: Respect politeness to avoid overloading websites

NFR5: Handle network failures and retries gracefully

Think Before You Design

Questions to Ask

❓ Question 1

❓ Question 2

❓ Question 3

❓ Question 4

❓ Question 5

❓ Question 6

❓ Question 7

Key Components

URL Frontier Manager

Fetcher (HTTP client)

Parser (HTML and link extractor)

Robots.txt Manager

URL Deduplication Store

Data Storage (for pages and metadata)

Scheduler and Prioritizer

Monitoring and Logging

Design Patterns

Producer-Consumer for fetching and parsing

Distributed Queue for URL frontier

Bloom Filters or Hash Sets for deduplication

Rate Limiting for politeness

Retry and Backoff strategies

Sharding for scaling storage

Reference Architecture

  +-------------------+       +-------------------+       +-------------------+
  | Seed URLs         |       | URL Frontier      |       | Robots.txt Manager |
  +---------+---------+       +---------+---------+       +---------+---------+
            |                           |                           |
            v                           v                           v
  +-------------------+       +-------------------+       +-------------------+
  | URL Deduplication |<----->| Scheduler &       |<----->| Politeness &      |
  | Store             |       | Prioritizer       |       | Rate Limiter      |
  +---------+---------+       +---------+---------+       +---------+---------+
            |                           |                           |
            v                           v                           v
  +-------------------+       +-------------------+       +-------------------+
  | Fetcher (HTTP)    |<----->| Parser (HTML &    |<----->| Data Storage      |
  +-------------------+       | Link Extractor)   |       | (Pages & Metadata)|
                              +-------------------+       +-------------------+

Components

URL Frontier Manager

Distributed Queue (e.g., Kafka, RabbitMQ)

Stores URLs to be crawled and manages their prioritization

Fetcher

HTTP Client Library (e.g., libcurl, Requests)

Fetches web pages from the internet respecting politeness

Parser

HTML Parser (e.g., BeautifulSoup, jsoup)

Extracts page content and discovers new URLs from fetched pages

Robots.txt Manager

Custom or existing robots.txt parser

Checks and enforces crawling rules per website

URL Deduplication Store

Bloom Filter or Distributed Hash Set (e.g., Redis, Cassandra)

Prevents crawling the same URL multiple times

Scheduler & Prioritizer

Custom scheduling logic with priority queues

Decides which URLs to crawl next based on priority and politeness

Data Storage

Distributed Storage (e.g., HDFS, S3, NoSQL DB)

Stores crawled page content and metadata for later use

Monitoring & Logging

Prometheus, ELK Stack

Tracks system health, crawl progress, and errors

Request Flow

1. Start with seed URLs loaded into the URL Frontier Manager.

2. Scheduler picks URLs from the frontier respecting priority and politeness.

3. Robots.txt Manager checks if crawling the URL is allowed.

4. Fetcher downloads the page content if allowed.

5. Parser extracts page content and finds new URLs.

6. New URLs are checked against the URL Deduplication Store to avoid repeats.

7. Unique new URLs are added back to the URL Frontier Manager.

8. Fetched page content and metadata are stored in Data Storage.

9. Monitoring tracks progress and errors throughout the process.

Database Schema

Entities: - URL: {id (PK), url, status, last_crawled, priority} - PageContent: {url_id (FK), content, content_type, fetch_time} - RobotsTxtRules: {domain, rules, fetched_time} Relationships: - URL to PageContent is 1:1 - URL to RobotsTxtRules via domain matching (not direct FK)

Scaling Discussion

Bottlenecks

URL Frontier Manager becomes a bottleneck with millions of URLs

Fetcher limited by network bandwidth and latency

URL Deduplication Store grows large and slow

Data Storage size and write throughput

Scheduler complexity with many URLs and politeness constraints

Solutions

Partition URL Frontier by domain or hash to distribute load

Use multiple fetcher instances distributed geographically

Use scalable probabilistic data structures like Bloom filters with periodic resets

Use distributed storage systems with sharding and replication

Implement domain-based scheduling to parallelize while respecting politeness

Interview Tips

Time: 10 minutes for requirements and clarifications, 15 minutes for architecture and components, 10 minutes for scaling discussion, 10 minutes for Q&A

Clarify scale and politeness requirements upfront

Explain URL frontier and deduplication importance

Discuss how to respect robots.txt and crawl delays

Describe components and their interactions clearly

Address scaling challenges with partitioning and distribution

Mention failure handling and monitoring for reliability