Bird
Raised Fist0
SEO Fundamentalsknowledge~6 mins

Robots.txt configuration in SEO Fundamentals - Full Explanation

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Introduction
Websites want to control which parts search engines can see and which parts stay private. Without clear instructions, search engines might index pages that should remain hidden or waste time on unimportant pages.
Explanation
Purpose of robots.txt
Robots.txt is a simple text file placed on a website that tells search engines which pages or sections they should not visit or index. It helps manage the website’s visibility on search engines and protects sensitive or irrelevant content from appearing in search results.
Robots.txt guides search engines on what parts of a website to avoid crawling.
User-agent directive
This part specifies which search engine or robot the rules apply to. For example, 'User-agent: *' means the rules apply to all search engines, while naming a specific robot targets only that one. This allows websites to customize instructions for different crawlers.
User-agent defines which search engines the rules affect.
Disallow directive
Disallow tells the robot which pages or folders it should not visit. If a path is listed here, the robot will avoid crawling those pages. Leaving Disallow empty means the robot can crawl everything, while 'Disallow: /' blocks the entire site.
Disallow lists the parts of the website robots should not crawl.
Allow directive
Allow is used to override a Disallow rule for specific pages or folders. For example, if a whole folder is disallowed but one page inside it should be accessible, Allow specifies that exception. This helps fine-tune what robots can see.
Allow lets specific pages be crawled even if their folder is disallowed.
Location and format
The robots.txt file must be placed in the website’s root folder (like example.com/robots.txt) and must be plain text. It follows a simple line-by-line format with directives and values. Search engines look for this file automatically before crawling.
Robots.txt must be in the website root and follow a simple text format.
Real World Analogy

Imagine a library where some bookshelves are open to all visitors, but others are behind locked doors. The librarian puts up signs telling visitors which shelves they can browse and which are off-limits. Sometimes, a special book inside a locked shelf is allowed for viewing with permission.

Purpose of robots.txt → Library signs showing which shelves visitors can access
User-agent directive → Signs addressed to specific visitor groups or all visitors
Disallow directive → Signs marking shelves that visitors cannot enter
Allow directive → Special permission to view a book inside a restricted shelf
Location and format → The librarian placing signs clearly at the library entrance
Diagram
Diagram
┌─────────────────────────────┐
│        Website Root          │
│  (example.com/robots.txt)    │
├─────────────────────────────┤
│ User-agent: *               │
│ Disallow: /private/         │
│ Allow: /private/public.html │
└─────────────────────────────┘
          ↓
┌─────────────────────────────┐
│ Search Engine Robots Read    │
│ robots.txt and Follow Rules │
└─────────────────────────────┘
This diagram shows the robots.txt file at the website root giving crawl instructions to search engine robots.
Key Facts
robots.txtA text file that tells search engines which parts of a website to avoid crawling.
User-agentSpecifies which search engine or robot the rules apply to.
DisallowLists pages or folders that robots should not visit.
AllowSpecifies exceptions where robots can crawl despite a disallow rule.
File locationrobots.txt must be placed in the website's root directory.
Common Confusions
robots.txt blocks pages from appearing in search results.
robots.txt blocks pages from appearing in search results. robots.txt only stops robots from crawling pages; it does not guarantee those pages won’t appear in search results if linked elsewhere.
Disallow means the page is deleted or inaccessible to users.
Disallow means the page is deleted or inaccessible to users. Disallow only restricts robots, not human visitors; users can still access those pages normally.
Summary
Robots.txt helps websites control which parts search engines can crawl and index.
It uses User-agent, Disallow, and Allow directives to give clear instructions to different robots.
The file must be placed in the website root and follow a simple text format to work correctly.

Practice

(1/5)
1. What is the main purpose of a robots.txt file on a website?
easy
A. To tell search engines which pages to crawl or not crawl
B. To speed up the website loading time
C. To store user login information
D. To create a sitemap for the website

Solution

  1. Step 1: Understand the role of robots.txt

    The robots.txt file is used to give instructions to search engine robots about which parts of the website they can access.
  2. Step 2: Identify the correct purpose

    It does not speed up loading, store user data, or create sitemaps. Its main role is to control crawling.
  3. Final Answer:

    To tell search engines which pages to crawl or not crawl -> Option A
  4. Quick Check:

    robots.txt controls crawling = D [OK]
Hint: robots.txt controls crawling rules for search engines [OK]
Common Mistakes:
  • Thinking robots.txt speeds up website
  • Confusing robots.txt with sitemap.xml
  • Assuming robots.txt stores user data
2. Which of the following is the correct syntax to block all web crawlers from accessing the entire website in robots.txt?
easy
A. User-agent: * Disallow: /
B. User-agent: * Disallow:
C. User-agent: all Disallow: /
D. User-agent: * Allow: /

Solution

  1. Step 1: Understand the syntax for blocking all

    To block all crawlers, use User-agent: * to target all, and Disallow: / to block the entire site.
  2. Step 2: Check each option

    User-agent: * Disallow: allows all because Disallow is empty. User-agent: all Disallow: / uses 'all' which is invalid. User-agent: * Allow: / allows all pages.
  3. Final Answer:

    User-agent: * Disallow: / -> Option A
  4. Quick Check:

    Block all with Disallow: / = A [OK]
Hint: Use Disallow: / to block entire site for all agents [OK]
Common Mistakes:
  • Leaving Disallow empty to block site
  • Using 'all' instead of '*' for user-agent
  • Using Allow instead of Disallow to block
3. Given the following robots.txt content, which URL will be blocked from crawling?
User-agent: Googlebot
Disallow: /private/

User-agent: *
Disallow: /temp/
medium
A. https://example.com/private/info.html by Bingbot
B. https://example.com/temp/info.html by Googlebot
C. https://example.com/public/page.html by Googlebot
D. https://example.com/private/data.html by Googlebot

Solution

  1. Step 1: Analyze rules for Googlebot

    Googlebot is blocked from /private/ but not from /temp/ because the specific rule for Googlebot disallows /private/ only.
  2. Step 2: Analyze rules for other bots

    All other bots (like Bingbot) are blocked from /temp/ but not /private/.
  3. Final Answer:

    https://example.com/private/data.html by Googlebot -> Option D
  4. Quick Check:

    Googlebot blocked /private/ = B [OK]
Hint: Specific user-agent rules override general ones [OK]
Common Mistakes:
  • Assuming all bots blocked from /private/
  • Ignoring user-agent specific rules
  • Confusing /temp/ and /private/ paths
4. Identify the error in this robots.txt snippet:
User-agent: *
Disallow /admin/
medium
A. User-agent should be capitalized
B. Missing colon after Disallow
C. Disallow path should be empty to block
D. User-agent cannot be *

Solution

  1. Step 1: Check syntax for Disallow directive

    Each directive must have a colon after the keyword. Here, Disallow is missing a colon.
  2. Step 2: Verify other parts

    User-agent can be '*', capitalization is not strict, and Disallow path is correct to block /admin/.
  3. Final Answer:

    Missing colon after Disallow -> Option B
  4. Quick Check:

    Directives need colon after keyword = A [OK]
Hint: Check for colon after directives like Disallow [OK]
Common Mistakes:
  • Omitting colon after Disallow
  • Thinking * is invalid user-agent
  • Believing capitalization matters
5. You want to allow Googlebot to crawl everything except the /private/ folder, but block all other bots from the entire site. Which robots.txt configuration achieves this?
hard
A. User-agent: Googlebot Allow: / User-agent: * Disallow: /private/
B. User-agent: * Disallow: / User-agent: Googlebot Allow: /private/
C. User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: /
D. User-agent: * Disallow: /private/ User-agent: Googlebot Disallow: /

Solution

  1. Step 1: Understand Googlebot's rule

    Googlebot should be allowed everywhere except /private/, so Disallow: /private/ applies to Googlebot.
  2. Step 2: Understand other bots' rule

    All other bots (*) should be blocked from the entire site, so Disallow: / applies to them.
  3. Step 3: Check options

    User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: / matches these rules exactly. Other options either allow or block incorrectly.
  4. Final Answer:

    User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: / -> Option C
  5. Quick Check:

    Googlebot partial block, others full block = C [OK]
Hint: Use specific user-agent rules before general ones [OK]
Common Mistakes:
  • Reversing Allow and Disallow for Googlebot
  • Blocking Googlebot fully by mistake
  • Using Allow incorrectly for blocking