LIS 525 - Logging

A log file is an automatically updated file that lists actions that have occurred. A Web server maintains a log file listing every request made to it. Additional information about requesters may be logged by means of cookies.

Since log files can be very long and mix information about different requesters, it is useful to have some log file analysis tools, which can summarize such things as site performance, visitor origins and return rates, and navigation patterns.

Some Web hosting services (6775 in a search on HostIndex.com, October, 2007), allow clients to access their own log files. Many provide access to output from log file analysis tools.

Some Things To Do With Log Files

  1. Look for incomplete hits by noting the bytes transferred.
  2. Look for visitors who never click past your homepage or never get to some page that you consider a target.
  3. Look for visitors repeatedly entering your site on pages other than your homepage (you may want to add more keywords or meta tags to the home page).
  4. Look for "file not found" errors (files may be missing or links or search engine entries may need updating).
  5. See whether visitors are using browsers that support features on your site.
  6. If you have an on-site search engine, look for patterns of keywords that visitors use.
  7. Check for evidence of inappropriate behavior by content providers (such as using small images in spam HTML mail to gather recipients' IP numbers surreptitiously).
  8. See how your visitors are finding your site (if you have referrer logging enabled, and with the caveat that this information may be faked by spammers).

Some Features of Log Analyzers

Log Files In Apache

By default, log files in Apache are in Common Log Format. This format contains a separate line for each request, composed of several values separated by spaces, in the form
host ident authuser date request status bytes
A missing value is represented by a hyphen (-). The values are as follows:
host host and domain name or IP number of the client
ident identity information reported by the client, if this is enabled on client and server
authuser userid, if a password protected document is requested
date date and time of request
request client request line in double quotes (")
status three digit status code returned to the client
bytes number of bytes in the object returned to the client, not including headers

Various custom log formats can also be defined.

A common alternative is Combined Log Format, which adds referral information and browser information.

Here is an example of a few lines from a common log format file (reformated as paragraphs for easier viewing):

205.152.129.34 - - [16/Jan/2005:06:03:16 -0500] "GET /~craven/525prx.htm HTTP/1.0" 200 2412 "http://search.yahoo.com/search?p=proxy+lis&sm=Yahoo%21+Search&fr=FP-tab-web-t&toggle=1&ei=UTF-8" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

205.152.129.34 - - [16/Jan/2005:06:03:17 -0500] "GET /~craven/525s.jpg HTTP/1.0" 200 740 "http://525.fims.uwo.ca/~craven/525prx.htm" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

68.142.250.47 - - [16/Jan/2005:06:05:16 -0500] "GET /robots.txt HTTP/1.0" 404 1046 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

68.142.251.42 - - [16/Jan/2005:06:05:23 -0500] "GET /~craven/525est.htm HTTP/1.0" 200 4142 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)"

IP numbers are recorded and not resolved into host and domain names. ident and authuser values are generally missing. Note the "404" (file not found) error in the third entry, in response to a spider's request for the site's robots.txt file, which does not exist (the spider in this case is Slurp, Inktomi's Web robot). The other responses are all "200" (OK), for files that can be delivered with no problems. There is no referrer information in the third and fourth entries; in the first entry, we can see the Yahoo! query that lead to the page requested; in the second entry, the referrer is an HTML page that requires the requested JPEG file in order to display completely.

A copy of a sample log file is available on the course SharePoint site at http://faculty.fims.uwo.ca/craven/lis525/Shared%20Documents/access_log.

There is a Linux logresolve utility that will attempt to translate IP numbers in a log file into host and domain names. This utility was used to create the sample resolved log file at http://faculty.fims.uwo.ca/craven/lis525/Shared%20Documents/access_log_r (using the command /usr/bin/logresolve <access_log >access_log_r). For example, resolving the log shows that the first two requests in the sample above came from ns2.co.escambia.fl.us. In this case, looking up the IP number with ARIN is actually more informative, telling us that it belongs to BellSouth.net, Inc., in Atlanta.

Since log files continue to grow, they need to be pruned, deleted, or rotated periodically. Apache has a program rotatelogs that rotates the log file without having to restart the server.

Using Analog to Analyze a Log File

Analog is a free log analyzer with versions that can be run under a variety of operating systems, including Windows and Linux. You can download Analog from one of many mirror sites. To install it, just unzip the downloaded file and extract the contents to a new directory. Read the introduction in how-to/startwin/index.html and/or the guide in docs/Readme.html. For practical use, you will need to edit the analog.cfg configuration file, at least by changing the HOSTNAME and HOSTURL commands; if you want to keep Analog's sample log file logfile.log, you should also change the LOGFILE command to specify the log file that you want to analyze instead. Run analog.exe from Windows or from the Command (DOS) prompt. To view the report, open report.html in a browser.

One section that is included in the report by default is actually of no use for the sample log files: the Monthly Report, because each file covers only a few days. The Domain and Organization reports are also useless if the program is applied to the raw log file, which contains only IP numbers, not domain names.

Using Webalizer

Webalizer is another free package for log file analysis. For this program, you have to specify parameters following webalizer.exe in the command line; for example,
webalizer.exe -n 525.fims.uwo.ca -o u:\ u:\access_log_r
Webalizer will create a file index.html and some other files (whose names will all contain the word usage) in the directory specified (the root directory of the U drive, in the example above). It will also create, or update, a file webalizer.hist.

For More Information


Home

Last updated October 31, 2007.
This page maintained by Prof. Tim Craven
E-mail (text/plain only): craven@uwo.ca
Faculty of Information and Media Studies
University of Western Ontario,
London, Ontario
Canada, N6A 5B7