Preventing search engine indexing

July 4, 2006

There’s a lot of talk today about search engine optimization, ie trying to rank a page higher in search engines by using special techniques. I believe much of the time doing this is wasted. But not all is about getting people to find stuff on your page, sometimes you want people not to find some things, and that is what this post is about.

There are three main ways of preventing search engines from indexing certain pages, robots.txt, meta robots tags and the rel=”nofollow” attribute for links.

robots.txt

robots.txt is a small text file which you place in the server root, eg http://www.testdomain.com/robots.txt. Note that putting in another directory than the root won’t work. http://www.testdomain.com/folder/robots.txt will in other words not be detected by search engine spiders. robots.txt can contain a number of commands. The commands of interest for indexing prevention are User-agent, Disallow and Allow. Every robots.txt should start with a User-agent command which decides which Agent (In other words search engine spider) should be affected by the following commands. Generally you want to use the same rules for all engines, which is represented by the command User-agent: *

Next up is the Allow and Disallow commands. URL’s followed by the Allow command will be crawled and indexed, while those following Disallow will not. Here’s a simple example of a robots.txt file disallowing one directory.

User-agent: *
Disallow: /secretdir/

This will tell all robots to exclude http://www.yourdomain.com/secretdir/ and any files and subdirectories in that folder.

What about the Allow command then? The Allow command can be combined with Disallow to create custom rules. Say you want to have http://www.yourdomain.com/secretdir/publicpage.html indexed, but nothing else in the directory http://www.yourdomain.com/secretdir/. Then you would first have an Allow command to allow the particular page to be indexed, and then a Disallow command to prevent the rest of the directory to be indexed.

User-agent: *
Allow: /secretdir/publicpage.html
Disallow: /secretdir/

Which touches the subject of precedence. In which order are the rules applied? The RFC standard says that the first applicable rule should be used. Consider this excample:

User-agent: *
Allow: /
Disallow: /secretdir/

According to the RFC standard this example would not do what the webmaster wanted, that is allow everything except http://www.yourdomain.com/secretdir/. Following the RFC rules, everything, including http://www.yourdomain.com/secretdir/, would be indexed. This is so because the first applicable rule would be applied, in this case Allow: /, since secretdir is situated under /. However, Google has chosen to misinterpret the rules so that the most specific rule is applied first, in this case Disallow: /secretdir/ since it’s more specific than Allow: /.
But if you want to play it safe and make sure all search engines interpret the rules correctly, you should place the Disallow rule in the example above the Allow rule.

But since, as shown above, robots.txt files cannot be stored in subdirectories, this solution cannot be used for some cheap/free web hosts where you do not have access to your own domain. Which leads us to:

Meta tags

Meta tags are tags containing meta data, that is data about the document in they are placed in. Meta tags can specify alot of different information, but the tag of interest for this post is the robots tag. It can take on four values, which can be combined. This an example on what the HTML code for the tag could look like:

<meta name="robots" value="noindex, nofollow" />

This code will tell the robot to neither index the page nor follow links from it.
The four values the value attribute can take on are {noindex, nofollow, index, follow}. The index attribute explicitly tells the robot to index the page it appears on. The follow attribute explicitly tells the robot to index any outgoing links from thepage it appears on. The no-prefixed attributes explicitly tells the robot not to do the respective thing.

Despite what some SEO freaks say, you typically never need to add the index attribute; the robot will index the page by default. The only exception might be if you have added noindex but want links to be followed. In that case you might want to add follow to make sure that links are followed, but even in that case that shouldn’t be necessary.

So, the meta robots method gives control over how each html page is treated by search engine robots, but the last methods offers even more fine grained control.

rel=”nofollow” attribute for links

This last attribute lets you control individually for every link whether it should be followed by a search engine robot. You could use this attribute to prevent that a certain page on your site gets indexed, but that’s a bad way of doing it. As long as there’s another link pointing to that page it will be indexed anyway. Use one of the other methods instead to be certain.

The real beauty with attributing links with rel=”nofollow” is that robots pretend that the link doesn’t exist. Since Google and other search engines rank a page based on how many times it has been linked to, this is an effective way of linking to a page without supporting it by increasing its rank.

<a href="http://www.thatpageidontlike.com" rel="nofollow">link me beautiful</a>

If you have thoughts or questions about this post, don’t hesitate to leave a comment below.

2 Responses to “Preventing search engine indexing”

  1. Kevin Says:

    Your info really help me to understand this…thanks

  2. Dane Kantner Says:

    I can think of a few more ways that I think some more sophisticated sites might be using, if they’re trying to hide the fact that they’re hiding it from the index namely.

    The IPs of many search engines are fairly easy to determine. If you know Googlebot’s IP ranges, for instance, you can serve up a totally altogether different experience to it. (Including dynamically altering the robots.txt file that those IP ranges see, dynamically adding in nofollow tags, dynamically adding in to a page header the noindex robots meta tag.

    Also, since this blog has been written there has been a new method added in… you can add a tag to your HTTP header called the X-robots tag…

    set: X-robots-Tag “noindex”


Leave a comment