Timothy C. Craven
Faculty of Information and Media Studies
The University of Western Ontario,
London, Ontario N6A 5B7
519-661-2111 ext. 88497. Fax: 519-661-3506.
Unpublished paper, 2005
The research in this and the other article just noted represents an extension to a series of research reports on how people and organizations summarize Web pages, especially how they summarize their own Web pages in descriptions and keywords in meta tags (Craven 2004a; Craven 2004b, Craven 2004c; and relevant items cited therein), though also to some extent how they summarize external Web pages (Craven 2002).
The present article concentrates on img tags that refer to files with certain common names. Name repetition for different files on different sites, and even within the same site, is entirely to be expected. Types of names identified as commonly repeated in the research reported in the other article included letters of the alphabet, shapes commonly found in Web pages ("arrow", "ball", "bullet", "dot", "star", and the like), roles and directions ("back", "banner", "bottom", "logo", "new", "search", "top", "visit", "vote"), and indicators that the image is intended for formatting or tracking rather than for visibility ("blank", "clear", "clearpixel", "pixel", "shim", "spacer1", "track", "transparent").
One type of file named for a letter of the alphabet is clearly one consisting of an image of that letter, intended for use either as a decorative initial or otherwise. Files with role and direction names not infrequently contain images of those role and direction names. Thus, questions arise concerning the relationship between file names and text content of the corresponding images.
In related research, Kanungo, Lee, and Bradford (2002) looked at the relationship between text in images and text in the referencing HTML file as a whole, finding that 42% of sampled images contained text and that 59% of images with text contained at least one word that did not appear in the HTML file, while 36% of images with text contained only words also found in the HTML file.
For an image of text, equivalent, if not exactly identical, text has been recommended (Korpela 2005; Letourneau and Freed 2000). Thus, one would expect images containing text to have alt attributes regardless of correspondence to the name of the file.
If the image is an initial capital, it is recommended that the substitute text should just be the capital letter (Tobias 2004). If an image whose file name is a letter is, in fact, that letter capitalized and is used as an initial, one would therefore expect the img tag to have an alt attribute, whose value should, as stated, be the corresponding capital.
For an image representing punctuation, such as a bullet, either the punctuation (Korpela 2005; Idocs 2002; Tobias 2004) or an equivalent expression such as "item:" should be employed as the alt text (Korpela 2005), at least if not obtrusive (Flavell 2004), or even just a space (Flavell 2004) or an empty string (WatchFire 2005). Thus, one would expect that img tags referring to the common file name "bullet" would have alt attributes, but that these would be short, or even empty, if the image was designed as a punctuative bullet.
For an image of a symbol, the name of the symbol should be used as the alt text (Korpela 2005). Korpela (2005) and Idocs (2002) deprecate the use of ASCII art, such as "==>" for an arrow, although using a row of hyphens for a horizontal rule seems to be acceptable (Tobias 2004). One might expect that this advice would apply to the common file names "arrow", "ball", and "star", for example, at least when the files contained the abstract shapes. Research for the other article in this set identified no common file names for horizontal rules, though Paek and Smith (2003) used occurrences of the keywords "rule" and "line" in accompanying text to categorize image use as "decorative".
An empty alt text has been recommended for graphics included for spacing (U.S. Access Board 2004; Tobias 2004), purely decorative images (Korpela 2005; Idocs 2002; Bersvendsen 2004; Tobias 2004), mere illustrations, images in navigational links in which suitable text is already present (Korpela 2005), or "graphics which do not convey content" (WebAIM 2005). Thus, one might expect that img tags referencing common file names such as "blank" and "shim" would still have alt attributes, but that the alt text would typically be empty.
In practice, of course, the alt attribute is frequently omitted (Lopresti and Zhou 2000;. Mukherjea, Hirata, and Hara 1999), and it is perhaps precisely in cases of images included for formatting and decoration or which add nothing to content conveyed by the visible text that page creators are most likely not to bother.
Filtering was left at "moderate", mostly in order to avoid accessing too many sites that exhibited technical bad behaviour.
Three sets of searches were performed: 26 searches on letters of the alphabet; 5 searches on common shape names ("arrow", "ball", "bullet", "dot", "star"); and 2 searches on common role/direction names ("new", "visit").
To increase the number of items retrieved, "repeat the search with omitted results included" was selected, except in the case of the letter names, where this was done only if fewer than 25 valid examples were found on the first search.
Each search was on the required filename, with the extension .gif; for example, "a.gif". Any results that did not match exactly (for instance, "b-ball.gif" in response to "ball.gif") were excluded.
For each image file selected, the approximate size was recorded as given in kilobytes by Google.
Letter files were categorized as showing either the letter, a corresponding glyph in a different writing system (for example, semaphore), or other. Images included text in addition to the letter were generally placed in the last of these categories, unless the additional text was extremely unobtrusive.
Shape files were categorized as showing either the abstract shape, an object specified by the name (in a drawing or photograph, for example of an archery contest for the name "arrow"), or other.
The "new" and "visit" files were categorized as showing either only the file name (possibly with some added punctuation), the file name with other text, other text without the filename, or no text. Any occurrences of the file name in text counted, whether as a separate word or as part of a longer word.
The Web page that included the image was accessed by following the appropriate link in the Google results. For most pages, Google first showed a frame-based preview page, from which a further link lead to the original; the link from the results page led directly to the original page in some instances, however.
If a referencing page was unavailable, was clearly an automatically generated directory, or obviously did not use the image, the image was eliminated.
From the display of the original referencing page in the browser, the HTML source was called up and searched for the first img tag with src attribute containing the required file name with extension. If this operation failed (say because the image file had been renamed), the image was eliminated. If the search succeeded, the value of the alt attribute was copied, or "<<none>>" if no value was present.
Because a single site might show a file for various different letters of the alphabet, the data set for the letter files was subsequently pruned automatically to select only a single random file reference from each server name.
|Null alt||Other alt||No alt||Total|
|Not showing letter||60||198||716||974|
Relatively few of the images showing letters were, in fact, used as initials; a common other use was as colouring-book pages designed to be printed out.
A total of 1511 shape-named files were found, of which 281 (18.6%) were images of the corresponding abstract shapes.
Files that were images of the abstract shapes were significantly less likely to be labelled with alt text (chi-squared significance = 0.0018), but the actual difference was quite small (32.0% versus 34.5%):
|Null alt||Other alt||No alt||Total|
In general, shape-named files showed significant differences by name in the extent to which they were assigned alt texts (chi-squared significance = 0.0004), ranging from 31.0% for "ball" to 40.6% for "star"; the difference in proportions was even greater when restricted to abstract-shape images, but was not statistically significant (chi-squared significance = 0.0619). The proportion of images of abstract shapes varied from 5.1% for "dot" to 28.6% for "arrow".
The median length of the shape-named files was 13 kilobytes. Short files (defined as those below the median) were significantly more likely to be images of the abstract shapes than were long files (those at or above the median) (31.1% versus 6.8%) (chi-squared significance = 0.0000):
A total of 85 files were found with the name "new" and 377 with the name "visit". Overall, 5.8% contained images of the name only, 41.1% of the name with additional text, 31.8% of other text without the name, and 21.2% of no text.
|Name only||Text including name||Other text||No text||Total|
Images of text equal to or containing the file name were significantly more likely to be assigned alt text than other images (53.0% versus 38.8%) (chi-squared significance = 0.0022). These images were also marginally significantly more likely to be assigned alt text than images containing text not including the file name (chi-squared significance = 0.0115). There was not, however, a significant difference between images containing text without the file name and images containing no text.
In viewing the records for "visit", it was observed that file sizes appeared to decline with progression through the Google results listing. This declining trend is clearly visible in the following chart: one-kilobyte files do not turn up until about rank 270, after which they are frequent; conversely, files larger than 100 kilobytes are encountered from time to time up to about rank 250, after which they disappear entirely.
A common form consisted of the file name and extension followed by a byte count. This has sometimes been recommended for larger files, and, in fact, many of the letter files with such alt values appear to have been quite large.
For the shape-named files, alt texts were again generally short. The longest was "[ If using a graphical browser, or if viewing images, you'd be viewing an animated gif of some balls doing a cascade now - you're probably glad you never bothered aren't you]". The filename-bytecount format was rare for "arrow" but common enough for "ball", "bullet", "dot", and "star"; a simple file name plus extension without the byte count was also found in a number of instances.
There were no really long alt texts for "new". The longest for "visit" was "j.gümbel TEXTIL-VERTRIEBS GMBH; Messerschmittstr. 22, Postfach 2629, 89216 Neu-Ulm, Telefon 0700/562645689, 0700/Jobmiloty, Telefax 0731/722069, E-Mail email@example.com" reproducing the text of a Visitenkarte (business card). The filename-plus-byte-count format was again fairly common.
That larger images should rank higher on average is not entirely surprising. If one assumes that users are more normally looking for pictures of objects, people, or scenes from the real world, then it is generally reasonable to assume that they will be more likely to be satisfied with more detailed images, such as will require more storage space in files. Links from other sites, which are commonly employed as one indicator in ranking Web objects for retrieval, seem more likely to be made to images that provide more detail.
A practical disadvantage for the present research is the likelihood of bias in the sampling method. Google cuts off search results at 500 items; links to items beyond that rank are not made available. It is probable that many of the files missed by the searches are small, simple images, which are thus under-represented in the results.