Challenge - 5 Problems

🎖️

Web Crawler Mastery

Get all challenges correct to earn this badge!

Test your skills under time pressure!

❓ Architecture

intermediate

2:00remaining

Identify the main components of a web crawler architecture

Which of the following lists correctly represents the essential components of a scalable web crawler system?

AParser, Renderer, User Tracker, URL Frontier, Logger

BUser Interface, Database, Cache, Logger, Scheduler

CLoad Balancer, API Gateway, Authentication, Fetcher, Parser

DURL Frontier, Fetcher, Parser, URL Filter, Storage

Attempts:

2 left

❓ scaling

intermediate

2:00remaining

Scaling the URL Frontier in a distributed crawler

What is the best approach to scale the URL Frontier component to handle billions of URLs efficiently?

APartition URLs by domain and distribute queues across multiple servers

BStore all URLs in a relational database with ACID transactions

CUse a centralized queue stored on a single server with large memory

DKeep URLs in local files on each crawler node without coordination

Attempts:

2 left

❓ tradeoff

advanced

2:00remaining

Tradeoffs in politeness and crawling speed

Which option best describes the tradeoff between politeness (respecting website rules) and crawling speed in a web crawler?

AIgnoring robots.txt and crawling aggressively increases speed but risks IP bans

BPoliteness has no impact on crawling speed or system design

CCrawling slowly with delays respects politeness but reduces throughput

DUsing multiple IPs allows fast crawling without any politeness concerns

Attempts:

2 left

🧠 Conceptual

advanced

2:00remaining

Handling duplicate content in web crawling

What is the most effective method to detect and avoid storing duplicate pages in a web crawler system?

AUse hash functions on page content and compare hashes

BCompare full page content byte-by-byte before storing

CStore all pages and remove duplicates later manually

DIgnore duplicates and rely on URL uniqueness only

Attempts:

2 left

❓ estimation

expert

2:00remaining

Estimating storage needs for a large-scale web crawler

If a web crawler downloads 1 billion pages per year, each averaging 500 KB, what is the approximate storage needed per year (in terabytes) to store raw pages without compression?

AApproximately 180 TB

BApproximately 500 TB

CApproximately 15 TB

DApproximately 1000 TB

Attempts:

2 left