LIS 525 - Robots

Some Categories of Bots

Telling Robots Not to Access Pages

You can include a robots.txt file on a server that you administer; this file, following a certain standard format, tells compliant robots not to visit parts of your site. For example, the robots.txt file on the main UWO server (, tells all robots that read it (except the UWO-InktomiSearch robot) not to visit ccs/export/, its/ftp/, /western/PeopleSoft/newsletter/, and a long list of other directories. This file begins as follows:
# robots.txt for

# Inktomi's web robot will obey the first record in the robots.txt file with a User-Agent containing "UWO-InktomiSearch".
# If there is no such record, It will obey the first entry with a User-Agent of "*".
# Because nothing is disallowed, everything is allowed

User-agent: UWO-InktomiSearch

# specifies that no robots should visit
# any URL starting with "/ccs/export/"

User-agent: *
Disallow: /its/ftp/
Disallow: /ccs/export/
Disallow: /www/Usage/
(Lines beginning with # are just comments).

Perhaps one in five servers actually uses a robots.txt file (for example, the FIMS servers,, and do not). If you do use a robots.txt file on your server, you are advised at least to keep robots out of any cgi directory.

Note that inclusion of a robots.txt file may cause access to your site to be blocked by filtering software because the providers of the software are not allowed to check the site's entire contents.

You can also include a "robots" meta tag in an individual HTML file at any level:

<meta name="robots" content="noindex">
<meta name="robots" content="nofollow">
Less than 10% of home pages use this meta tag, including those that use it with the contents "index", "follow", or "all".

For More Information


Last updated October 29, 2007.
This page maintained by Prof. Tim Craven
E-mail (text/plain only):
Faculty of Information and Media Studies
University of Western Ontario,
London, Ontario
Canada, N6A 5B7