Introduction
Websites want to control which parts search engines can see and which parts stay private. Without clear instructions, search engines might index pages that should remain hidden or waste time on unimportant pages.
Jump into concepts and practice - no test required
Imagine a library where some bookshelves are open to all visitors, but others are behind locked doors. The librarian puts up signs telling visitors which shelves they can browse and which are off-limits. Sometimes, a special book inside a locked shelf is allowed for viewing with permission.
┌─────────────────────────────┐
│ Website Root │
│ (example.com/robots.txt) │
├─────────────────────────────┤
│ User-agent: * │
│ Disallow: /private/ │
│ Allow: /private/public.html │
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ Search Engine Robots Read │
│ robots.txt and Follow Rules │
└─────────────────────────────┘robots.txt file on a website?robots.txt?User-agent: * to target all, and Disallow: / to block the entire site.robots.txt content, which URL will be blocked from crawling?
User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: /temp/
robots.txt snippet:
User-agent: * Disallow /admin/
Disallow is missing a colon./private/ folder, but block all other bots from the entire site. Which robots.txt configuration achieves this?