Bird
Raised Fist0
SEO Fundamentalsknowledge~10 mins

Robots.txt configuration in SEO Fundamentals - Step-by-Step Execution

Choose your learning style10 modes available

Start learning this pattern below

Jump into concepts and practice - no test required

or
Recommended
Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong
Concept Flow - Robots.txt configuration
Start: Browser or Bot Requests URL
Check for robots.txt file
Allow full access
Read robots.txt rules
Match User-agent
Check Disallow/Allow rules
Decide if URL is Allowed
Block URL from crawling
Allow URL to be crawled
When a bot visits a website, it looks for robots.txt to find rules about which pages it can or cannot visit.
Execution Sample
SEO Fundamentals
User-agent: *
Disallow: /private/
Allow: /private/public-info.html
This robots.txt blocks all bots from /private/ folder except the file /private/public-info.html
Analysis Table
StepActionInputRule MatchedDecision
1Bot requests URL/private/data.htmlN/ACheck robots.txt
2Read robots.txtUser-agent: *Matches all botsContinue
3Check Disallow/private/Matches prefix of URLDisallow applies
4Check Allow/private/public-info.htmlDoes not match URLNo Allow override
5Final decision/private/data.htmlDisallowBlock crawling
6Bot requests URL/private/public-info.htmlN/ACheck robots.txt
7Read robots.txtUser-agent: *Matches all botsContinue
8Check Disallow/private/Matches prefixDisallow applies
9Check Allow/private/public-info.htmlExact matchAllow overrides Disallow
10Final decision/private/public-info.htmlAllowAllow crawling
11Bot requests URL/public/page.htmlN/ACheck robots.txt
12Read robots.txtUser-agent: *Matches all botsContinue
13Check Disallow/private/Does not match URLNo Disallow
14Final decision/public/page.htmlNo rules matchedAllow crawling
💡 Decisions made based on matching rules; URLs either allowed or blocked accordingly.
State Tracker
VariableStartAfter Step 3After Step 4After Step 5After Step 9After Step 10After Step 14
URLN/A/private/data.html/private/data.html/private/data.html/private/public-info.html/private/public-info.html/public/page.html
Disallow MatchedFalseTrueTrueTrueTrueTrueFalse
Allow MatchedFalseFalseFalseFalseFalseTrueFalse
Final DecisionN/AN/AN/ABlockN/AAllowAllow
Key Insights - 3 Insights
Why does /private/public-info.html get allowed even though /private/ is disallowed?
Because the Allow rule for /private/public-info.html is more specific and overrides the broader Disallow for /private/ as shown in steps 8-10.
What happens if there is no robots.txt file?
Bots assume no restrictions and crawl all pages, as shown in the flow where 'No' robots.txt leads to full access.
Does the order of rules in robots.txt matter?
Yes, but the most specific rule for a URL takes priority regardless of order, as seen in the Allow overriding Disallow for a specific file.
Visual Quiz - 3 Questions
Test your understanding
Look at the execution table, what is the final decision for the URL '/private/data.html' at step 5?
ABlock crawling
BAllow crawling
CNo decision made
DPartial access
💡 Hint
Check the 'Final Decision' column at step 5 in the execution_table.
At which step does the Allow rule override the Disallow rule for '/private/public-info.html'?
AStep 4
BStep 8
CStep 9
DStep 10
💡 Hint
Look at the 'Allow Matched' and 'Final Decision' columns around steps 8-10.
If the Disallow rule was removed, what would be the final decision for '/private/data.html'?
ABlock crawling
BAllow crawling
CNo robots.txt found
DDepends on User-agent
💡 Hint
Refer to the variable_tracker and execution_table rows where no Disallow means allow by default.
Concept Snapshot
Robots.txt tells bots which parts of a website to crawl or avoid.
Use 'User-agent' to specify bots.
'Allow' and 'Disallow' set access rules.
More specific rules override broader ones.
If no robots.txt, bots crawl everything.
Full Transcript
When a bot visits a website, it first looks for a robots.txt file. If found, it reads the rules inside. These rules specify which parts of the site the bot can or cannot visit. The bot matches its name to the 'User-agent' rules. Then it checks 'Disallow' and 'Allow' paths to decide if it can crawl a URL. More specific rules override general ones. If no robots.txt exists, bots assume they can crawl all pages. For example, if '/private/' is disallowed but '/private/public-info.html' is allowed, the bot will crawl the allowed file but not the rest of the private folder.

Practice

(1/5)
1. What is the main purpose of a robots.txt file on a website?
easy
A. To tell search engines which pages to crawl or not crawl
B. To speed up the website loading time
C. To store user login information
D. To create a sitemap for the website

Solution

  1. Step 1: Understand the role of robots.txt

    The robots.txt file is used to give instructions to search engine robots about which parts of the website they can access.
  2. Step 2: Identify the correct purpose

    It does not speed up loading, store user data, or create sitemaps. Its main role is to control crawling.
  3. Final Answer:

    To tell search engines which pages to crawl or not crawl -> Option A
  4. Quick Check:

    robots.txt controls crawling = D [OK]
Hint: robots.txt controls crawling rules for search engines [OK]
Common Mistakes:
  • Thinking robots.txt speeds up website
  • Confusing robots.txt with sitemap.xml
  • Assuming robots.txt stores user data
2. Which of the following is the correct syntax to block all web crawlers from accessing the entire website in robots.txt?
easy
A. User-agent: * Disallow: /
B. User-agent: * Disallow:
C. User-agent: all Disallow: /
D. User-agent: * Allow: /

Solution

  1. Step 1: Understand the syntax for blocking all

    To block all crawlers, use User-agent: * to target all, and Disallow: / to block the entire site.
  2. Step 2: Check each option

    User-agent: * Disallow: allows all because Disallow is empty. User-agent: all Disallow: / uses 'all' which is invalid. User-agent: * Allow: / allows all pages.
  3. Final Answer:

    User-agent: * Disallow: / -> Option A
  4. Quick Check:

    Block all with Disallow: / = A [OK]
Hint: Use Disallow: / to block entire site for all agents [OK]
Common Mistakes:
  • Leaving Disallow empty to block site
  • Using 'all' instead of '*' for user-agent
  • Using Allow instead of Disallow to block
3. Given the following robots.txt content, which URL will be blocked from crawling?
User-agent: Googlebot
Disallow: /private/

User-agent: *
Disallow: /temp/
medium
A. https://example.com/private/info.html by Bingbot
B. https://example.com/temp/info.html by Googlebot
C. https://example.com/public/page.html by Googlebot
D. https://example.com/private/data.html by Googlebot

Solution

  1. Step 1: Analyze rules for Googlebot

    Googlebot is blocked from /private/ but not from /temp/ because the specific rule for Googlebot disallows /private/ only.
  2. Step 2: Analyze rules for other bots

    All other bots (like Bingbot) are blocked from /temp/ but not /private/.
  3. Final Answer:

    https://example.com/private/data.html by Googlebot -> Option D
  4. Quick Check:

    Googlebot blocked /private/ = B [OK]
Hint: Specific user-agent rules override general ones [OK]
Common Mistakes:
  • Assuming all bots blocked from /private/
  • Ignoring user-agent specific rules
  • Confusing /temp/ and /private/ paths
4. Identify the error in this robots.txt snippet:
User-agent: *
Disallow /admin/
medium
A. User-agent should be capitalized
B. Missing colon after Disallow
C. Disallow path should be empty to block
D. User-agent cannot be *

Solution

  1. Step 1: Check syntax for Disallow directive

    Each directive must have a colon after the keyword. Here, Disallow is missing a colon.
  2. Step 2: Verify other parts

    User-agent can be '*', capitalization is not strict, and Disallow path is correct to block /admin/.
  3. Final Answer:

    Missing colon after Disallow -> Option B
  4. Quick Check:

    Directives need colon after keyword = A [OK]
Hint: Check for colon after directives like Disallow [OK]
Common Mistakes:
  • Omitting colon after Disallow
  • Thinking * is invalid user-agent
  • Believing capitalization matters
5. You want to allow Googlebot to crawl everything except the /private/ folder, but block all other bots from the entire site. Which robots.txt configuration achieves this?
hard
A. User-agent: Googlebot Allow: / User-agent: * Disallow: /private/
B. User-agent: * Disallow: / User-agent: Googlebot Allow: /private/
C. User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: /
D. User-agent: * Disallow: /private/ User-agent: Googlebot Disallow: /

Solution

  1. Step 1: Understand Googlebot's rule

    Googlebot should be allowed everywhere except /private/, so Disallow: /private/ applies to Googlebot.
  2. Step 2: Understand other bots' rule

    All other bots (*) should be blocked from the entire site, so Disallow: / applies to them.
  3. Step 3: Check options

    User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: / matches these rules exactly. Other options either allow or block incorrectly.
  4. Final Answer:

    User-agent: Googlebot Disallow: /private/ User-agent: * Disallow: / -> Option C
  5. Quick Check:

    Googlebot partial block, others full block = C [OK]
Hint: Use specific user-agent rules before general ones [OK]
Common Mistakes:
  • Reversing Allow and Disallow for Googlebot
  • Blocking Googlebot fully by mistake
  • Using Allow incorrectly for blocking