Dreamweaver Resources Dreamweaver Resources - Templates - CSS Templates
Web Page Templates Nav Bars &  Web Page Tempates Free Templates Dreamweaver Resources & Tools Ecommerce Software Search Engine Optimisation Dreamweaver Tutorials Books
 
Page 1| Page 2 | Page 3 | Page 4 | Template Packages | Template Sale

 

:: Search Engine Optimization :: How The Robots Exclusion Protocol Works

Some of you may ask what is it and why do we need it? In a nutshell, as it’s name implies, the Robots exclusion protocol is used by Webmasters and site owners to prevent search engine crawlers (or spiders) from indexing certain parts of their Web sites. It could be for a number of reasons, such as sensitive corporate information, semi-confidential data, information that needs to stay private, or to prevent certain programs or scripts from being indexed, etc.

A search engine crawler or spider is a Web “robot” and will normally follow the robots.txt file (Robots exclusion protocol) if it is present in the root directory of a Website. The robots.txt exclusion protocol was developed at the end of 1993 and still today remains the Internet’s standard for controlling how search engine spiders access a particular website.

If the robots.txt file can be used to prevent access to certain parts of a web site, if not correctly implemented, it can also prevent access to the whole site! On more than one occasion, I have found the robots exclusion protocol (Robots.txt file) to be the main culprit of why a site wasn't listed in certain search engines. If it isn't written correctly, it can cause all kinds of problems and, the worst part is, you will probably never find out about it just by looking at your actual HTML code.

When a client asks me to analyse a website that has been online for about a year and it still isn’t listed in certain engines, the first place I look is the robots.txt file. Once I have corrected it and written it for his website, and once I have optimized his most important keywords, usually the rankings will go up within the next thirty days or so.

As the name implies, the “Disallow” command in a robots.txt file instructs the search engine’s robots to "disallow reading", but that certainly does not mean "disallow indexing". In other words, a disallowed resource may be listed in a search engine’s index, even if the search engine follows the protocol. On the other hand, an allowed resource, such as many of the public (HTML) files of a website can be prevented from being indexed if the Robots.txt file isn’t carefully written for the search engines to understand.

The most obvious demonstration of this is the Google search engine. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file.

In so doing, it is not violating the robots.txt protocol, because it’s not reading any disallowed resources, it is simply reading other web sites' links to those resources, which Google constantly uses for its page rank algorithm, among other things.

Contrary to popular belief, a website does not necessarily need to be “read” by a robot in order to be indexed. To the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index, in practice, most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them from adding resources or disallowed files to their index.

Most modern search engines today interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index. Conversely, if it’s already in their index, placed there by previous crawling activity, they would normally remove it. This last point is important, and an example will illustrate that critical subject.

The inadequacies and limitations of the robots exclusion protocol are indicative of what sometimes could be a bigger problem. It is impossible to prevent any directly accessible resource on a site from being linked to by external sites, be they partner sites, affiliates, websites linked to competitors or, search engines.

Even with the robots.txt file, there is no legal or technical reason why they should be used, least of all by humans creating links, for which the standards were not written. In itself, this may not seem a bad idea, but there are many instances when a site owner would rather exclude a particular page from the Web. If such is the case, the robots.txt file will, to a certain degree help the site owner achieve his or her goals.

Since most websites normally change often and new content is constantly created or updated, it is strongly recommended that the robots.txt file in your website be re-evaluated at least once a month. If necessary, it only takes a minute or two to edit this small file in order to make the changes required. Never assume that ‘it must be OK, so I don’t need to bother with it’. Take a few minutes and look at the way it’s written. Ask yourself these questions:

1. Did I add some sensitive files recently?
2. Are there new sections I don’t want indexed?
3. Is there a section I want indexed but isn’t?

As a rule of thumb, even before adding a file or a group of files that contain sensitive information that you don’t want to be indexed by the search engines, you should edit your Robots.txt file before uploading those files to your server. Make sure you place them in a separate directory. You could name it: private_files or private_content and add each of those directories to your Robots exclusion file to prevent the spiders from indexing any of those private directories.

Also, if you find that you have files in a separate directory but you want them indexed, if those public files have been on your server for more than a month and are still not indexed, have a look at your Robots.txt file to make certain there are no errors in any of it’s commands.

Example of Robots Text - read more here

Serge Thibodeau 2003