Link Trick: Robots.txt
By: Mandeep
“Robots.txt” is a regular text file. By defining a few rules in this text file, you can instruct robots to not crawl and index certain files, directories within your site, or at all.It resides in the root directory of a website. This file stands alone and you cannot know its contents by looking at a standard webpage on that site. Using a robots.txt file gives you a search engine robots point of view.
Creating your “robots.txt” file:
Make sure it’s name i.e “robots.txt” & must be uploaded to the root accessible directory of your site.
1) Here’s a basic “robots.txt”:
User-agent: *
Disallow: /
A USERAGENT line to identify the crawler in question followed by one or more
DISALLOW: lines to disallow it from crawling certain parts of your site.
2)
User-agent: *
Disallow: /cgi-bin/
Disallow: /privatedir/
Disallow: /tutorials/blank.htm
It disallows all search engines and robots from crawling select directories and pages.
3)
User-agent: Googlebot-Image
Disallow: /
If you do not want Google’s Image bot to crawl your site’s images and making them searchable online,to save bandwidth the above declaration will do the same.
4)
User-agent: *
Disallow: /
User-agent: Googlebot
Disallow: /cgi-bin/
Disallow: /privatedir/
This is interesting- here we declare that crawlers in general should not crawl any parts of our site, EXCEPT for Google, which is allowed to crawl the entire site apart from /cgi-bin/ and /privatedir/. So the rules of specificity apply, not inheritance.
5)
User-agent: *
Disallow: /
User-agent: Googlebot
Allow: /
This is the preferred way to disallow all crawlers from your site EXCEPT Google.
An alternative to “robots.txt” is to use the robots meta tag when your web host prohibits you from uploading “robots.txt” to the root directory, or you simply wish to restrict crawlers from a few select pages on your site.
The “robots” meta tag should be added between the HEAD section of your page in question as:
<meta name="robots" content="noindex,nofollow" />
Some Of Examples are as:
1) <meta name="robots" content="noindex,nofollow" />
This disallows both indexing and following of links by a crawler on that specific page.
2)
<meta name="robots" content="noindex,follow" />
Stop all robots from indexing a page on your site, but still follow the links on the page.
3)
<meta name="robots" content="index,nofollow" />
This allows indexing of the page, but instructs the crawler not to follow outgoing links.
4)
<meta name="robots" content="none">
This is a shorthand way of declaring don’t index nor follow links on page.
5)
<meta name="googlebot" content="noindex,follow" />
Allow other robots to index the page on your site, preventing only Googles bots from indexing the page
Spread the word: related/readit






