beginner

What is the main purpose of a web crawler?

A web crawler automatically browses the internet to collect and index web pages for search engines or data analysis.

Click to reveal answer

beginner

What is a URL frontier in a web crawler?

The URL frontier is a queue or list that stores URLs to be visited next by the crawler.

Click to reveal answer

intermediate

Why is politeness important in web crawling?

Politeness means respecting website rules like robots.txt and limiting request rates to avoid overloading servers.

Click to reveal answer

intermediate

What role does the parser play in a web crawler?

The parser extracts useful information and new URLs from the downloaded web pages for further crawling.

Click to reveal answer

intermediate

How can a web crawler avoid visiting the same page multiple times?

By maintaining a visited URL set or database to track and skip URLs that have already been crawled.

Click to reveal answer

What component stores URLs waiting to be crawled?

AParser

BURL frontier

CDownloader

DIndexer

Which file tells a crawler which pages it should not visit?

Aindex.html

Bsitemap.xml

Crobots.txt

Dconfig.json

What is the main reason to limit the crawl rate?

ATo avoid overloading web servers

BTo speed up crawling

CTo reduce storage needs

DTo increase bandwidth usage

Which component extracts links from a web page?

AParser

BDownloader

CScheduler

DCache

How does a crawler avoid visiting duplicate pages?

ABy downloading pages twice

BBy increasing crawl speed

CBy ignoring robots.txt

DBy tracking visited URLs

Explain the main components of a web crawler and their roles.

Describe how a web crawler respects website rules and avoids overloading servers.