- Published on
Robots.txt
- Authors
- Name
- Frank
What is the robots.txt file for? When a search engine spider visits your site a number of them check to see if a robots.txt file is there first, if it is and it contains rules that apply to the particular bot/spider visiting then it will follow those rules. The rules that go in the file are exclusionary - that is to say that they may stop the bot from indexing some of the content but they cannot direct the bot any further than that.
Why is it good to have a robots.txt file? Well if you do not have any areas on your site that you want to exclude from search engine indexes then it isn't necessary to have a robots.txt file at all. However, when a bot comes to your site and requests a robots.txt file it will get a 404 error (header response) in return - this error will pop up in your logs so if you don't want to clutter your logs and return errors then just create an empty robots.txt file.
Where does the robots file go? In the root directory of your site and nowhere else. There should be only ONE robots.txt file for each domain, if you put robots.txt files in the sub folders of your site they will never get seen end of story.
**What to put in the file?
**
To exclude all robots from the entire site:
User-agent: *
Disallow: /
To allow all robots complete access, either have an empty file or....
User-agent: *
Disallow:
Exclude all bots from certain parts of the site:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
Exclude a single bot:
User-agent: BadBot
Disallow: /
Allow a single bot access to your site:
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
You can also disallow individual files:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
Note: If you are using mod-rewrite and a robots.txt file the robots who use the robots file will check there first, so if you have an instance where the whole site is excluded in the robots file, and then in mod_rewrite you are doing a 301 redirect to another site, the robots will obey the no index directive in the robots file and in theory never see the redirect.
Robots files can get a little tricky especially if you are using mod_rewrite to generate essentially 'virtual' directories and you want to stop the indexing of certain directories when really that directory does not exist. I'm still trying to get to the bottom of this little intricity, basically you can put an entry in for the 'virtual' directory but it isn't going to stop the bots from indexing the final desination URL (mod_rewritten one), in which case if you do not have any links to the final destination URL and it is only the result of another URL being mod_rewritten then I think you should be fine just excluding the first URL.
- Exclusion protocol (with examples)
- Robots Meta Tag
- Free Robots.txt creator