All about Robots.txt
A Website contains the robots.txt (robots exclusion protocol) file to provide instructions about their Website Structure to Web crawlers. It contains the specific set of instructions, used to inform Web Robots, Wanderers, Crawlers or Spiders about the areas of your Website, which should not be Crawled, Processed or Indexed.
Understand Robots.txt working
Web Crawler and Robots first check the robots.txt file before accessing your website. According to the instructions written in the file, it crawls and index the particular sections of your Website. It has only three commands.
- User-agent: It is used for Web Crawler/Robot selection.
- Allow: This command tells the robot to crawl and index contents of a particular URL.
- Disallow: This command blocks the robot to crawl and index contents of a particular URL.
Important considerations to keep in mind
- Malware and other Web Crawler/Robot that scan your website for security vulnerabilities will never follow robots.txt instructions.
- It is publicly visible, so anyone can see and read it.
- Wrong commands may block your website from search engine. Use it with caution.
Note: We can write comments in the robots.txt file, followed by #.
|Yahoo||Slurp / YahooSeeker|
Learn Robots.txt commands with examples
1. Allow indexing of everything
2. Disallow indexing of everything
3. Disallow indexing of a specific directory
4. Disallow indexing of a specific page
5. Disallow indexing of a specific file
6. Disallow specific crawlers from indexing of a specific directory
7. Disallow indexing of a directory, except for the one file in that directory
8. Disallow indexing of a specific directory for a multiple Crawlers:
9. Crawl-Delay (Use to decrease crawling rate):
User-agent: Bingbot (Google Bot don’t support delay command directly)
Crawl-delay: 10 (where the 10 is in seconds)
10. Robots.txt Wildcard URL:
Disallow: /*.png (To block access to all URLs that include .png at last)