LIS 525 - Robots
Some Categories of Bots
- Chatter bots "talk" with the user,
- Commerce bots perform commerce activities
on the World Wide Web and the Internet.
- Game bots help in playing computer games
or act as opponents in these games.
- Knowledge bots use artificial intelligence
to search for information.
- News bots create custom newspapers
or clipping services.
- Search bots search the Internet and World Wide Web
for information.
- Shopping bots do shopping and price comparison on the
World Wide Web.
- Spiders, Webcrawlers, or Web robots
automatically fetch Web pages referenced by other Web pages;
the pages are then fed, for example,
to search engines.
- Stock bots track stock-related information.
- Update bots monitor selected Web sites for any
updated materials.
Telling Robots Not to Access Pages
You can include a robots.txt file on a server
that you administer;
this file, following a certain standard format,
tells compliant robots not to visit parts of your site.
For example, the robots.txt file
on the main UWO server (http://www.uwo.ca/robots.txt),
tells all robots that read it
(except the UWO-InktomiSearch robot)
not to visit ccs/export/,
its/ftp/,
/western/PeopleSoft/newsletter/,
and a long list of other directories.
This file begins as follows:
# robots.txt for http://www.uwo.ca/
#
# Inktomi's web robot will obey the first record in the robots.txt file with a User-Agent containing "UWO-InktomiSearch".
# If there is no such record, It will obey the first entry with a User-Agent of "*".
# Because nothing is disallowed, everything is allowed
User-agent: UWO-InktomiSearch
Disallow:
# specifies that no robots should visit
# any URL starting with "/ccs/export/"
User-agent: *
Disallow: /its/ftp/
Disallow: /ccs/export/
Disallow: /www/Usage/
...
(Lines beginning with # are just comments).
Perhaps one in five servers actually uses a
robots.txt file
(for example,
the FIMS servers www.fims.uwo.ca, faculty.fims.uwo.ca,
and intra.fims.uwo.ca do not).
If you do use a robots.txt file on your server,
you are advised at least
to keep robots out of any cgi directory.
Note that inclusion of a robots.txt file
may cause access to your site to be blocked by filtering software
because the providers of the software are not allowed to check
the site's entire contents.
You can also include a "robots" meta tag
in an individual HTML file at any level:
<meta name="robots" content="noindex">
or
<meta name="robots" content="nofollow">
Less than 10% of home pages use this meta tag,
including those that use it
with the contents "index", "follow", or
"all".
For More Information
Home
Last updated October 29, 2007.
This page maintained by
Prof. Tim Craven
E-mail (text/plain only): craven@uwo.ca
Faculty of Information and
Media Studies
University of Western
Ontario,
London, Ontario
Canada, N6A 5B7