Advanced HTML Guide

robots.txt and sitemap files

Introduction

Every website that is interested in getting as many visitors as possible should be using both robots.txt and sitemap files. They both perform quite different functions, but at the same time they complement each other which is why I've created a single page about both of them.

robots.txt

A robots.txt file is a text file in a simple format which gives information to web robots (such as search engine spiders) about which parts of your website they are and aren't allowed to visit.

If you don't have a robots.txt then a web robots will assume that they can go anywhere on your site. This simple robots.txt allows robots access to anywhere on your site. The only advantage of having one of these 'allow all' robots.txt is to stop you getting 404 errors in your log files when the spiders can't find your robots.txt.

User-agent: *
Disallow:

To use you simple place this file at the root of your webserver. So if your website is at http://www.advancedhtml.co.uk/ then the robots.txt must go at http://www.advancedhtml.co.uk/robots.txt.

If there are certain parts of your site that you don't want them to visit you can add a Disallow: line. This will stop well behaved robots from accessing the directories you specify. However not all robots are well behaved so don't rely on this as a method of stopping these directories from being indexed. If you don't want pages to be indexed then either don't put them on the web, or use a proper security scheme such as .htaccess password protection.

User-agent: *
Disallow: /data/
Disallow: /scripts/

You can even disallow all robots from accessing anywhere on your site with this robots.txt.

User-agent: *
Disallow: /

The 'User-agent' command can be used to restrict the commands to a specific web robots. In my examples I'm using a '*' to apply the commands to all robots.

Sitemap linking

One final command that you can use that relates to the next section of this page is the 'SITEMAP' command. This can be used to tell search engines or other robots where your sitemap is located. For example the complete robots.txt could look like this:

User-agent: *
Disallow:

SITEMAP: http://www.advancedhtml.co.uk/sitemap.txt

Limitations

  1. robots.txt are accessible to everyone so don't use them as a form of security!
  2. Although robots are supposed to obey your robots.txt not all of them do.

For more information on robots.txt files go to http://www.robotstxt.org/.


Sitemaps

Whereas robots.txt files are usually used to ask robots to avoid a particular part of your site, a sitemap is used to give the robot a list of pages that it is welcome to visit.

By giving the search engine a sitemap you can (hopefully) increase the number of pages that it indexes. As well as telling the search engine the URLs of your pages, the sitemap can also tell the robots when the page was last modified, the pages priority, and how often the page is likely to be updated.

Text format

There are two main sitemap formats. The simplest is a simple text file listing the full URLs of all your pages. The second is an XML file which can provide a lot more information. For this site I use a simple text file. Here is a shortened version of what it looks like.

http://www.advancedhtml.co.uk/
http://www.advancedhtml.co.uk/advancedhtml.htm
http://www.advancedhtml.co.uk/addtosearchengine.htm
http://www.advancedhtml.co.uk/colours.htm
http://www.advancedhtml.co.uk/faq.htm
http://www.advancedhtml.co.uk/htaccess.htm
http://www.advancedhtml.co.uk/javascript.htm
http://www.advancedhtml.co.uk/making-money-from-your-web-site.htm
http://www.advancedhtml.co.uk/password.htm
http://www.advancedhtml.co.uk/tables.htm
http://www.advancedhtml.co.uk/webspace.htm

The file format shouldn't need much explanation. It is just a text file with the list of URLs. I save it as sitemap.txt and put it on my webserver at http://www.advancedhtml.co.uk/sitemap.txt. Note that from the robots.txt section I include a line in my robots.txt which points to this sitemap. This allows the search engines to find it more easily.

XML format

The XML version of the sitemap format looks like that shown below. I would recommend that you generate them using a sitemap generation tool rather than trying to hand code them. Search for sitemap generation tools on Google. I used http://www.xml-sitemaps.com/ to create the below snippet.

<?xml version="1.0" encoding="UTF-8"?>
<urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
      http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
  <loc>http://www.advancedhtml.co.uk/</loc>
  <priority>1.00</priority>
  <changefreq>weekly</changefreq>
</url>
<url>
  <loc>http://www.advancedhtml.co.uk/advancedhtml.htm</loc>
  <priority>0.80</priority>
  <changefreq>weekly</changefreq>
</url>
<url>
  <loc>http://www.advancedhtml.co.uk/tables.htm</loc>
  <priority>0.80</priority>
  <changefreq>weekly</changefreq>
</url>
<url>
  <loc>http://www.advancedhtml.co.uk/colours.htm</loc>
  <priority>0.80</priority>
  <changefreq>weekly</changefreq>
</url>
</urlset>

You should call your XML sitemap 'sitemap.xml' and put it at the root of your web server. e.g. http://www.advancedhtml.co.uk/sitemap.xml.

Sitemap submission

If you add a reference to your sitemap to your robots.txt file then it should be found by the search engines automatically. However you can take a more active role in the sitemap submission process by using tools from Google, Microsoft and Yahoo. You can read more about these tools on my Website Analytics page.

  1. Google Webmaster Tools
  2. Yahoo Site Explorer
  3. Microsoft Webmaster Tools

Conclusions

robots.txt and sitemap files both serve differing but complementary purposes. I highly recommend that you use both of them on yor site to improve the coverage of your website in the major search engines.


  The  
Advanced  HTML
Site
Privacy Policy
Advanced HTML Home
Copyright © 1997 - 2024
Hosted by IONOS