Robots Text File.


The robots.txt file is a set of instructions for visiting robots or spiders (more commom name) that index the content of your web site pages. For those spiders that obey the file, it provides a map for what they can or cannot index. The file must reside in the root directory of your website. The URL (web address) of your robots.txt file should look like this.

http://www.yourSite.com/robots.txt

The Robots text file open in Notepad might look like this:

RobotsTextFileNotRec.gif

This is a screen shot of an empty robots.txt file and it is not recommended to do this.

Definition of Robots.txt file:

User-agent: * The asterisk (*) or wildcard represents a special value that means any robot. This is the only one needed until you fully understand how to set up different User-agents.

Disallow: The Disallow: line without a / (forward slash) tells the spider that it can index the whole site.

Any empty value, indicates that all of the site can be retrieved. At least one Disallow field should to be present in a record without the / (forward slash) as shown above.

The presence of an empty "robots.txt" file means it will be treated as if it was not there, all robots will consider themselves at home.

The Disallow: line without the trailing slash (/) tells all robots to index everything. If you have a line that looks like this:

Disallow: /myStuff/ It tells the spider that it can’t index the contents of the /myStuff/directory.

To allow all robots complete access:
User-agent: *
Disallow:

Important Note: The above format is the common and acceptable standard for allowing all spiders access to the site. It has been said, that the practice of having just a User-agent: * and Disallow: without a trailing forward slash in other words an empty robots.txt file may cause some spiders may incorrectly interpret this as blocking all content.

In 2003+ issues with Google may suggest that disallowing your css directory could be a flag for a manual review to see if you are using css to deceive the indexing spiders.Therefore it could be a good idea not to disallow your /css/ directory.

Screen Shot of Robots Text FileYou’ll notice in this screen shot of the robots.txt file that I’ve disallowed the robot to the myStuff and classes folders and I do not recommend an empty file.  

To exclude all robots from the server:
User-agent: *
Disallow: /

To exclude all robots from parts of a server:
User-agent: *
Disallow: /myStuff/
Disallow: /images/
Disallow: /classes/

To exclude a single robot from the server:
User-agent: robotName
Disallow: /

To exclude a single robot from parts of a server:
User-agent: robotName
Disallow: /myStuff/
Disallow: /images/
Disallow: /classes/

If you want to Disallow: a particular file within the directory, your Disallow: line might look like this one:
Disallow: /myStuff/top_secret.htm

Keeping in mind that using the above example excludes the specified page top_secret.htm but will not exclude the entire /myStuff/ directory.

If you have files that you do not want indexed, then you should put them in a folder and Disallow: the entire directory or put them in a password protected directory, best of all don’t put them on the web!

Remember: The robots.txt file always resides at the root level of your website.

What if you can’t make a robots.txt file?

Sometimes you cannot make a robots.txt file, because you don’t administer the entire server. All is not lost: there is a new standard for using HTML META tags to keep robots out of your documents.

The solution is to include a META tag like:
<meta name="robots" content="noindex"> This will retrieve the document but it will not index the document.
<meta name="robots" content="nofollow"> This will not follow any links that are present on the page to other documents.
<meta name="robots" content="noarchive"> The majority of search machines maintain a cache of all the documents that they fetch. If you do not wish us to archive a document from your site, you can place this tag in the head of the document and seach machines should not provide an archive copy for the document.

The "robots" tag is obeyed by many different web robots. If you’d like to specify some of these restrictions say just for Google’s Robot called "googlebot", you may use "googlebot" in place of "robots". You can also combine any or all of these tags into a single meta tag. For example:
<meta name="googlebot" content="noarchive, nofollow">

Here are a few online references for information on the Robots.

  1. Web Robots Developers Google.
  2. Robots.txt Specifications
  3. Robots-Tag HTTP header specifications.

CU
You have a good day
May the Higher Power of your choice bless you and yours.
Gary.


BlueBar.gif


Visitor N°   







  

Free Text Counter.
Counter added 10/02/2014.

Top