Bird
Raised Fist0
SEO Fundamentalsknowledge~5 mins

Robots.txt configuration in SEO Fundamentals - Cheat Sheet & Quick Revision

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Recall & Review
beginner
What is the purpose of a robots.txt file?
A robots.txt file tells web robots (like search engine crawlers) which parts of a website they can or cannot visit. It helps control what content is indexed.
Click to reveal answer
beginner
What does the User-agent directive specify in a robots.txt file?
The User-agent directive specifies which web robot the following rules apply to. For example, User-agent: Googlebot targets Google's crawler.
Click to reveal answer
beginner
How do you block all web crawlers from accessing your entire website using robots.txt?
You write:
User-agent: *
Disallow: /
This tells all robots not to visit any pages on the site.
Click to reveal answer
beginner
What does the Disallow directive do in a robots.txt file?
The Disallow directive tells the specified user-agent which paths or pages it should NOT crawl.
Click to reveal answer
intermediate
Can robots.txt prevent a page from being indexed if other sites link to it?
No. robots.txt only controls crawling. If other sites link to a page, search engines might still index its URL without content.
Click to reveal answer
What does User-agent: * mean in a robots.txt file?
AIt applies rules only to Googlebot
BIt allows all users to access the website
CIt blocks all users from the website
DIt applies rules to all web crawlers
How do you allow all web crawlers to access your entire website?
AUser-agent: *<br>Disallow: /
BUser-agent: *<br>Disallow:
CUser-agent: Googlebot<br>Disallow: /
DUser-agent: *<br>Allow: /private
Which directive blocks a specific folder from being crawled?
ADisallow: /folder/
BAllow: /folder/
CUser-agent: /folder/
DBlock: /folder/
If a page is blocked by robots.txt, can it still appear in search results?
AOnly if the page is on the homepage
BNo, it will never appear
CYes, if other sites link to it
DOnly if the page has a sitemap
Where should the robots.txt file be placed on a website?
AIn the root directory of the website
BIn the images folder
CIn the CSS folder
DAnywhere on the website
Explain how a robots.txt file controls web crawler access to a website.
Think about how you tell robots where they can and cannot go.
You got /5 concepts.
    Describe a scenario where blocking a page with robots.txt might not prevent it from appearing in search results.
    Consider what happens if other sites link to a blocked page.
    You got /3 concepts.

      Practice

      (1/5)
      1. What is the main purpose of a robots.txt file on a website?
      easy
      A. To tell search engines which pages to crawl or not crawl
      B. To speed up the website loading time
      C. To store user login information
      D. To create a sitemap for the website

      Solution

      1. Step 1: Understand the role of robots.txt

        The robots.txt file is used to give instructions to search engine robots about which parts of the website they can access.
      2. Step 2: Identify the correct purpose

        It does not speed up loading, store user data, or create sitemaps. Its main role is to control crawling.
      3. Final Answer:

        To tell search engines which pages to crawl or not crawl -> Option A
      4. Quick Check:

        robots.txt controls crawling = D [OK]
      Hint: robots.txt controls crawling rules for search engines [OK]
      Common Mistakes:
      • Thinking robots.txt speeds up website
      • Confusing robots.txt with sitemap.xml
      • Assuming robots.txt stores user data
      2. Which of the following is the correct syntax to block all web crawlers from accessing the entire website in robots.txt?
      easy
      A. User-agent: * Disallow: /
      B. User-agent: * Disallow:
      C. User-agent: all Disallow: /
      D. User-agent: * Allow: /

      Solution

      1. Step 1: Understand the syntax for blocking all

        To block all crawlers, use User-agent: * to target all, and Disallow: / to block the entire site.
      2. Step 2: Check each option

        User-agent: * Disallow: allows all because Disallow is empty. User-agent: all Disallow: / uses 'all' which is invalid. User-agent: * Allow: / allows all pages.
      3. Final Answer:

        User-agent: * Disallow: / -> Option A
      4. Quick Check:

        Block all with Disallow: / = A [OK]
      Hint: Use Disallow: / to block entire site for all agents [OK]
      Common Mistakes:
      • Leaving Disallow empty to block site
      • Using 'all' instead of '*' for user-agent
      • Using Allow instead of Disallow to block
      3. Given the following robots.txt content, which URL will be blocked from crawling?
      User-agent: Googlebot
      Disallow: /private/
      
      User-agent: *
      Disallow: /temp/
      
      medium
      A. https://example.com/private/info.html by Bingbot
      B. https://example.com/temp/info.html by Googlebot
      C. https://example.com/public/page.html by Googlebot
      D. https://example.com/private/data.html by Googlebot

      Solution

      1. Step 1: Analyze rules for Googlebot

        Googlebot is blocked from /private/ but not from /temp/ because the specific rule for Googlebot disallows /private/ only.
      2. Step 2: Analyze rules for other bots

        All other bots (like Bingbot) are blocked from /temp/ but not /private/.
      3. Final Answer:

        https://example.com/private/data.html by Googlebot -> Option D
      4. Quick Check:

        Googlebot blocked /private/ = B [OK]
      Hint: Specific user-agent rules override general ones [OK]
      Common Mistakes:
      • Assuming all bots blocked from /private/
      • Ignoring user-agent specific rules
      • Confusing /temp/ and /private/ paths
      4. Identify the error in this robots.txt snippet:
      User-agent: *
      Disallow /admin/
      
      medium
      A. User-agent should be capitalized
      B. Missing colon after Disallow
      C. Disallow path should be empty to block
      D. User-agent cannot be *

      Solution

      1. Step 1: Check syntax for Disallow directive

        Each directive must have a colon after the keyword. Here, Disallow is missing a colon.
      2. Step 2: Verify other parts

        User-agent can be '*', capitalization is not strict, and Disallow path is correct to block /admin/.
      3. Final Answer:

        Missing colon after Disallow -> Option B
      4. Quick Check:

        Directives need colon after keyword = A [OK]
      Hint: Check for colon after directives like Disallow [OK]
      Common Mistakes:
      • Omitting colon after Disallow
      • Thinking * is invalid user-agent
      • Believing capitalization matters
      5. You want to allow Googlebot to crawl everything except the /private/ folder, but block all other bots from the entire site. Which robots.txt configuration achieves this?
      hard
      A. User-agent: Googlebot Allow: / User-agent: * Disallow: /private/
      B. User-agent: * Disallow: / User-agent: Googlebot Allow: /private/
      C. User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: /
      D. User-agent: * Disallow: /private/ User-agent: Googlebot Disallow: /

      Solution

      1. Step 1: Understand Googlebot's rule

        Googlebot should be allowed everywhere except /private/, so Disallow: /private/ applies to Googlebot.
      2. Step 2: Understand other bots' rule

        All other bots (*) should be blocked from the entire site, so Disallow: / applies to them.
      3. Step 3: Check options

        User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: / matches these rules exactly. Other options either allow or block incorrectly.
      4. Final Answer:

        User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: / -> Option C
      5. Quick Check:

        Googlebot partial block, others full block = C [OK]
      Hint: Use specific user-agent rules before general ones [OK]
      Common Mistakes:
      • Reversing Allow and Disallow for Googlebot
      • Blocking Googlebot fully by mistake
      • Using Allow incorrectly for blocking