Imagine a search engine as a librarian who needs to find books in a huge library. Which step below best describes how the search engine finds web pages?
Think about how a librarian collects books to organize them.
Search engines use crawlers (also called spiders) to visit and read web pages. This is like a librarian walking through the library to find and catalog books.
Look at the simplified flowchart below showing a crawler visiting a web page:
Start -> Visit URL -> Read page content -> Extract links -> Store data -> End
What information does the crawler collect to help the search engine?
Think about what helps the search engine find and connect pages.
The crawler reads the text to understand the page's content and extracts links to find more pages to visit.
Which statement correctly compares the roles of crawling and indexing in search engines?
Think of crawling as collecting books and indexing as making a catalog.
Crawling is the process of visiting and reading pages. Indexing is storing and organizing the collected information so the search engine can quickly find relevant pages.
A crawler visits pages but never follows links to new pages. What problem will this cause?
Think about how the crawler discovers new pages.
If the crawler does not follow links, it cannot discover new pages beyond the starting points, so the search engine's index will be incomplete.
Given this simplified crawler code simulation:
pages = {"A": ["B", "C"], "B": ["C"], "C": ["A"]}
crawled = set()
def crawl(page):
if page not in crawled:
crawled.add(page)
for link in pages.get(page, []):
crawl(link)
crawl("A")
print(sorted(crawled))What will be printed?
pages = {"A": ["B", "C"], "B": ["C"], "C": ["A"]}
crawled = set()
def crawl(page):
if page not in crawled:
crawled.add(page)
for link in pages.get(page, []):
crawl(link)
crawl("A")
print(sorted(crawled))Trace the calls and see which pages get added to the set.
The crawler starts at "A", adds it, then visits "B" and "C". "B" links to "C" which is already added, and "C" links back to "A" which is also added. So all three pages are crawled.