0
0
SEO Fundamentalsknowledge~6 mins

Robots.txt configuration in SEO Fundamentals - Full Explanation

Choose your learning style9 modes available
Introduction
Websites want to control which parts search engines can see and which parts stay private. Without clear instructions, search engines might index pages that should remain hidden or waste time on unimportant pages.
Explanation
Purpose of robots.txt
Robots.txt is a simple text file placed on a website that tells search engines which pages or sections they should not visit or index. It helps manage the website’s visibility on search engines and protects sensitive or irrelevant content from appearing in search results.
Robots.txt guides search engines on what parts of a website to avoid crawling.
User-agent directive
This part specifies which search engine or robot the rules apply to. For example, 'User-agent: *' means the rules apply to all search engines, while naming a specific robot targets only that one. This allows websites to customize instructions for different crawlers.
User-agent defines which search engines the rules affect.
Disallow directive
Disallow tells the robot which pages or folders it should not visit. If a path is listed here, the robot will avoid crawling those pages. Leaving Disallow empty means the robot can crawl everything, while 'Disallow: /' blocks the entire site.
Disallow lists the parts of the website robots should not crawl.
Allow directive
Allow is used to override a Disallow rule for specific pages or folders. For example, if a whole folder is disallowed but one page inside it should be accessible, Allow specifies that exception. This helps fine-tune what robots can see.
Allow lets specific pages be crawled even if their folder is disallowed.
Location and format
The robots.txt file must be placed in the website’s root folder (like example.com/robots.txt) and must be plain text. It follows a simple line-by-line format with directives and values. Search engines look for this file automatically before crawling.
Robots.txt must be in the website root and follow a simple text format.
Real World Analogy

Imagine a library where some bookshelves are open to all visitors, but others are behind locked doors. The librarian puts up signs telling visitors which shelves they can browse and which are off-limits. Sometimes, a special book inside a locked shelf is allowed for viewing with permission.

Purpose of robots.txt → Library signs showing which shelves visitors can access
User-agent directive → Signs addressed to specific visitor groups or all visitors
Disallow directive → Signs marking shelves that visitors cannot enter
Allow directive → Special permission to view a book inside a restricted shelf
Location and format → The librarian placing signs clearly at the library entrance
Diagram
Diagram
┌─────────────────────────────┐
│        Website Root          │
│  (example.com/robots.txt)    │
├─────────────────────────────┤
│ User-agent: *               │
│ Disallow: /private/         │
│ Allow: /private/public.html │
└─────────────────────────────┘
          ↓
┌─────────────────────────────┐
│ Search Engine Robots Read    │
│ robots.txt and Follow Rules │
└─────────────────────────────┘
This diagram shows the robots.txt file at the website root giving crawl instructions to search engine robots.
Key Facts
robots.txtA text file that tells search engines which parts of a website to avoid crawling.
User-agentSpecifies which search engine or robot the rules apply to.
DisallowLists pages or folders that robots should not visit.
AllowSpecifies exceptions where robots can crawl despite a disallow rule.
File locationrobots.txt must be placed in the website's root directory.
Common Confusions
robots.txt blocks pages from appearing in search results.
robots.txt blocks pages from appearing in search results. robots.txt only stops robots from crawling pages; it does not guarantee those pages won’t appear in search results if linked elsewhere.
Disallow means the page is deleted or inaccessible to users.
Disallow means the page is deleted or inaccessible to users. Disallow only restricts robots, not human visitors; users can still access those pages normally.
Summary
Robots.txt helps websites control which parts search engines can crawl and index.
It uses User-agent, Disallow, and Allow directives to give clear instructions to different robots.
The file must be placed in the website root and follow a simple text format to work correctly.