Recall & Review
beginner
What is the main purpose of a web crawler?
A web crawler automatically browses the internet to collect and index web pages for search engines or data analysis.
Click to reveal answer
beginner
What is a URL frontier in a web crawler?
The URL frontier is a queue or list that stores URLs to be visited next by the crawler.
Click to reveal answer
intermediate
Why is politeness important in web crawling?
Politeness means respecting website rules like robots.txt and limiting request rates to avoid overloading servers.
Click to reveal answer
intermediate
What role does the parser play in a web crawler?
The parser extracts useful information and new URLs from the downloaded web pages for further crawling.
Click to reveal answer
intermediate
How can a web crawler avoid visiting the same page multiple times?
By maintaining a visited URL set or database to track and skip URLs that have already been crawled.
Click to reveal answer
What component stores URLs waiting to be crawled?
✗ Incorrect
The URL frontier holds URLs that the crawler will visit next.
Which file tells a crawler which pages it should not visit?
✗ Incorrect
robots.txt instructs crawlers about allowed and disallowed pages.
What is the main reason to limit the crawl rate?
✗ Incorrect
Limiting crawl rate prevents harming website performance.
Which component extracts links from a web page?
✗ Incorrect
The parser reads the page content and finds new URLs.
How does a crawler avoid visiting duplicate pages?
✗ Incorrect
Tracking visited URLs helps skip duplicates.
Explain the main components of a web crawler and their roles.
Think about how URLs are stored, fetched, processed, and tracked.
You got /5 concepts.
Describe how a web crawler respects website rules and avoids overloading servers.
Consider both technical and ethical aspects.
You got /4 concepts.
