LIS 504 - Noise and bias

Errors in making estimates about a population from sample data may be random, in which case they are called noise, or systematic, in which case they are called bias. The following example shows the two kinds of error.

Suppose you want to get a quick estimate of how long the average word in a text is. This is often used as one measure of how difficult the text is to read, which can be an important consideration in advising readers.

Actually averaging the length of all the words would be quite easy with the right software for an electronic text. But you probably would not want to do this for a text that was only available in printed form. So, instead, you might choose to sample the words.

You might take a systematic sample of the first (or last) word in each paragraph or in each line. Systematic samples, however, can produce systematic error, or bias.

Here, for instance, is a list of the word lengths, line by line, of an actual paragraph of text:

2 12 2 3 5 2 8 7 4
12 3 1 4 3 4 2 11
9 2 7 3 10 7 8 4
2 6 2 5 1 8 13 2 3 7
2 10 8 2 3 3 5 6 7
7 3 10 7 4 7 1
12 8 7 4 9 3 8
2 13 2 7 9 2 13 3
14 8 7 14 2
8 5 8 3 10 2 11
9 12 7
The actual average word length works out to about 6.06. The average word length of the first words in lines, however, is about 7.18, which is a bit too high as an estimate. Part of the reason for the difference lies in a systematic error. The first words in lines tend to be longer than average simply because of the way word wrapping is carried out: it is easier to squeeze in another small word on the end of a line than it is to squeeze in another long word.

Even a random sample may provide an estimate that is in error, though the error will be random. For example, using a random number table, we might select the following word lengths from the list above:

12 2 5 7
4
2
3
10
7
13
8
The mean of this random sample is about 6.64, which is still a bit higher than the actual mean though not as much so as that of the systematic sample.

If you have JavaScript enabled, you can use the following form to simulate taking more random samples from the set of word lengths. Click on the "Sample" button to take a sample. You will notice that the random error is sometimes as high or higher than the systematic error; but, more often, it is not as great. A fairly high random error is to be expected here, because of the small sample size (11).

Sampled lengths:
Mean length in sample:

Distribution of sample means:

Home

Last updated November 1, 2000.
This page maintained by Prof. Tim Craven
E-mail (text/plain only): t.craven@uwo.ca
Faculty of Information and Media Studies
University of Western Ontario,
London, Ontario
Canada, N6A 5B7