System Overview - Design a web crawler
A web crawler automatically browses the internet to collect and index web pages. It must efficiently fetch pages, avoid duplicate visits, and handle large-scale data with fault tolerance.
A web crawler automatically browses the internet to collect and index web pages. It must efficiently fetch pages, avoid duplicate visits, and handle large-scale data with fault tolerance.
User
|
v
Load Balancer
|
v
API Gateway
|
v
Scheduler ---> URL Frontier (Queue) ---> Fetcher Pool ---> Parser ---> Storage (Database)
| |
v v
Cache Message Queue