All about Robots.txt

A Website contains the robots.txt (robots exclusion protocol) file to provide instructions about their Website Structure to Web crawlers. It contains the specific set of instructions, used to inform Web Robots, Wanderers, Crawlers or Spiders about the areas of your Website, which should not be Crawled, Processed or Indexed.

Understand Robots.txt working

Web Crawler and Robots first check the robots.txt file before accessing your website. According to the instructions written in the file, it crawls and index the particular sections of your Website. It has only three commands.

  1. User-agent: It is used for Web Crawler/Robot selection.
  2. Allow: This command tells the robot to crawl and index contents of a particular URL.
  3. Disallow: This command blocks the robot to crawl and index contents of a particular URL.

Important considerations to keep in mind

  • Malware and other Web Crawler/Robot that scan your website for security vulnerabilities will never follow robots.txt instructions.
  • It is publicly visible, so anyone can see and read it.
  • Wrong commands may block your website from search engine. Use it with caution.

Note: We can write comments in the robots.txt file, followed by #.

User-agent List

Crawler User Agent
Google Googlebot
Google News Googlebot-News
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google AdSense Mediapartners-Google
Google AdsBot AdsBot-Google
Yahoo Slurp / YahooSeeker
Bing Bingbot
MSN Microsoft msnbot
Baidu Baiduspider
Naver (Korean) Yeti
Yandex YandexBot
Yandex Images YandexImages

Learn Robots.txt commands with examples


1.    Allow indexing of everything

User-agent: *

Disallow:

or

User-agent: *

Allow: /

2.    Disallow indexing of everything

User-agent: *

Disallow: /

3.    Disallow indexing of a specific directory

User-agent: *

Disallow: /any-directory/

4.    Disallow indexing of a specific page

User-agent: *

Disallow: /about-us.html

5.    Disallow indexing of a specific file

User-agent: *

Disallow: /image.png

6.    Disallow specific crawlers from indexing of a specific directory

User-agent: Googlebot

Disallow: /any-directory/

7.    Disallow indexing of a directory, except for the one file in that directory

Disallow: /any-directory/

Allow: /any-directory/myfile.html

8.    Disallow indexing of a specific directory for a multiple Crawlers:

User-agent: Googlebot

User-agent: Slurp

User-agent: Bingbot

Disallow: /any-directory/

9.    Crawl-Delay (Use to decrease crawling rate):

User-agent: Bingbot (Google Bot don’t support delay command directly)

Crawl-delay: 10 (where the 10 is in seconds)

10.    Robots.txt Wildcard URL:

User-agent: *

Disallow: /*.png (To block access to all URLs that include .png at last)

Summary
Article Name
Learn robots.txt disallow and allow commands with examples
Description
Learn robots.txt disallow and allow commands with examples. Understand it's purpose and usage. Read this article to easily generate your robots.txt file.
Author

LEAVE A REPLY