Images with Some Common File Names on Web Sites

Timothy C. Craven
Faculty of Information and Media Studies
The University of Western Ontario,
London, Ontario N6A 5B7
Canada
519-661-2111 ext. 88497. Fax: 519-661-3506.
craven@uwo.ca

Unpublished paper, 2005

Abstract

Google's advanced image search was used to find Web pages with img references to GIF files with names equal to the 26 letters of the alphabet, common shape designations ("arrow","ball", "bullet", "dot", and "star"), and two functional designations ("new" and "visit"). References to letter-named files were slightly more likely to have alt attributes if the images were of the letters (16.3% of cases) than if they were not. References to shape-named files were slightly less likely to have alt attributes if the images were of the abstract shapes (19.2% of cases); the proportion also differed significantly among shapes. Files of abstract shapes were significantly shorter than other identically named files. Functionally-named files tended to be shorter if they contained the names (44.7% for "new" and 47.0% for "visit") than if they did not or if they contained text that did not incorporate the names. References to functionally-named files were much more likely to have alt attributes if the files contained the names than if they did not (55.1% versus 38.8%). Limitations of the sampling method include Google's apparently tending to rank longer image files higher.

Introduction

This article reports on a further investigation into how people textually summarize images on their Web pages; specifically, their use of the alt attribute of the img tag. Another article on this topic (Craven 2006) showed about 17% of Web pages with no img elements, 48-49% of img references on the remainder with alt texts, of which 26-28% were null; 71-74% of identifiable images were GIF; 20-28%,  JPEG; and less than 1% PNG. GIF images were more commonly assigned null alt texts than JPEG images, and GIF files  tended to be shorter than JPEG files, possibly because GIF files are more suited to decorative images. Weak positive correlations were observed between image file length and alt text length. Previously literature on the alt tag was also summarized.

The research in this and the other article just noted represents an extension to a series of research reports on how people and organizations summarize Web pages, especially how they summarize their own Web pages in descriptions and keywords in meta tags (Craven 2004a; Craven 2004b, Craven 2004c; and relevant items cited therein), though also to some extent how they summarize external Web pages (Craven 2002).

The present article concentrates on img tags that refer to files with certain common names. Name repetition for different files on different sites, and even within the same site, is entirely to be expected. Types of names identified as commonly repeated in the research reported in the other article included letters of the alphabet, shapes commonly found in Web pages ("arrow", "ball", "bullet", "dot", "star", and the like), roles and directions ("back", "banner", "bottom", "logo", "new", "search", "top", "visit", "vote"), and indicators that the image is intended for formatting or tracking rather than for visibility ("blank", "clear", "clearpixel", "pixel", "shim", "spacer1", "track", "transparent").

One type of file named for a letter of the alphabet is clearly one consisting of an image of that letter, intended for use either as a decorative initial or otherwise. Files with role and direction names not infrequently contain images of those role and direction names. Thus, questions arise concerning the relationship between file names and text content of the corresponding images.

In related research,  Kanungo, Lee, and Bradford (2002) looked at the relationship between text in images and text in the referencing HTML file as a whole, finding that 42% of sampled images contained text and that 59% of images with text contained at least one word that did not appear in the HTML file, while 36% of images with text contained only words also found in the HTML file.

For an image of text, equivalent, if not exactly identical, text has been recommended (Korpela 2005; Letourneau and Freed 2000). Thus, one would expect images containing text to have alt attributes regardless of correspondence to the name of the file.

If the image is an initial capital, it is recommended that the substitute text should just be the capital letter (Tobias 2004). If an image whose file name is a letter is, in fact, that letter capitalized and is used as an initial, one would therefore expect the img tag to have an alt attribute, whose value should, as stated, be the corresponding capital.

For an image representing punctuation, such as a bullet, either the punctuation (Korpela 2005; Idocs 2002; Tobias 2004) or an equivalent expression such as "item:" should be employed as the alt text (Korpela 2005), at least if not obtrusive (Flavell 2004), or even just a space (Flavell 2004) or an empty string (WatchFire 2005). Thus, one would expect that img tags referring to the common file name "bullet" would have alt attributes, but that these would be short, or even empty, if the image was designed as a punctuative bullet.

For an image of a symbol, the name of the symbol should be used as the alt text (Korpela 2005). Korpela (2005) and Idocs (2002) deprecate the use of ASCII art, such as "==>" for an arrow, although using a row of hyphens for a horizontal rule seems to be acceptable (Tobias 2004). One might expect that this advice would apply to the common file names "arrow", "ball", and "star", for example, at least when the files contained the abstract shapes. Research for the other article in this set identified no common file names for horizontal rules, though Paek and Smith (2003) used occurrences of the keywords "rule" and "line" in accompanying text to categorize image use as "decorative".

An empty alt text has been recommended for graphics included for spacing (U.S. Access Board  2004; Tobias 2004), purely decorative images (Korpela 2005; Idocs 2002; Bersvendsen 2004; Tobias 2004), mere illustrations, images in navigational links in which suitable text is already present (Korpela 2005), or "graphics which do not convey content" (WebAIM 2005). Thus, one might expect that img tags referencing common file names such as "blank" and "shim" would still have alt attributes, but that the alt text would typically be empty.

In practice, of course, the alt attribute is frequently omitted (Lopresti and Zhou 2000;. Mukherjea, Hirata, and Hara 1999), and it is perhaps precisely in cases of images included for formatting and decoration or which add nothing to content conveyed by the visible text that page creators are most likely not to bother.

Hypotheses

The present study aimed, among other things, to test the following hypotheses about Web page images with common file names.
  1. References to files with names that are letters of the alphabet are more likely to use the alt attribute if the images are of the letters (the assumption being that a significant proportion of the images that are not the letters are intended to be decorative and so not to need alt text).
    1. References to files with names of shapes are more likely to use the alt attribute if the images are of the abstract shapes (on the grounds that such images serve as punctuation which is required for proper understanding of the text, while images of the real world are more likely to be seen as decorative)
      or
    2. References to files with names of shapes are less likely to use the alt attribute if the images are of the abstract shapes (on the grounds that such images serve as decoration which is not required for proper understanding of the text, while images of the real world are more likely to be seen as conveying substantive information)
  2. Files with names of shapes are smaller if they contain images of the abstract shapes (since abstract shapes are relatively simple, their files should be more compressible than, say, files containing pictures of physical objects).
  3. Files with common directional or role names are more likely to have alt attributes if they contain text (on the grounds that the text content is likely to be important to convey to all users) and especially if they contain text corresponding to their names.

Methodology

Google advanced image search was chosen to obtain a sample of image references in each category because Google's search capability appeared to produce more comprehensive results than did Yahoo!'s, as well a better precision.

Filtering was left at "moderate", mostly in order to avoid accessing too many sites that exhibited technical bad behaviour.

Three sets of searches were performed: 26 searches on letters of the alphabet; 5 searches on common shape names ("arrow", "ball", "bullet", "dot", "star"); and 2 searches on common role/direction names ("new", "visit").

To increase the number of items retrieved, "repeat the search with omitted results included" was selected, except in the case of the letter names, where this was done only if fewer than 25 valid examples were found on the first search.

Each search was on the required filename, with the extension .gif; for example, "a.gif". Any results that did not match exactly (for instance, "b-ball.gif" in response to "ball.gif") were excluded.

For each image file selected, the approximate size was recorded as given in kilobytes by Google.

Letter files were categorized as showing either the letter, a corresponding glyph in a different writing system (for example, semaphore), or other. Images included text in addition to the letter were generally placed in the last of these categories, unless the additional text was extremely unobtrusive.

Shape files were categorized as showing either the abstract shape, an object specified by the name (in a drawing or photograph, for example of an archery contest for the name "arrow"), or other.

The "new" and "visit" files were categorized as showing either only the file name (possibly with some added punctuation), the file name with other text, other text without the filename, or no text. Any occurrences of the file name in text counted, whether as a separate word or as part of a longer word.

The Web page that included the image was accessed by following the appropriate link in the Google results. For most pages, Google first showed a frame-based preview page, from which a further link lead to the original; the link from the results page led directly to the original page in some instances, however.

If a referencing page was unavailable, was clearly an automatically generated directory, or obviously did not use the image, the image was eliminated.

From the display of the original referencing page in the browser, the HTML source was called up and searched for the first img tag with src attribute containing the required file name with extension. If this operation failed (say because the image file had been renamed), the image was eliminated. If the search succeeded, the value of the alt attribute was copied, or "<<none>>" if no value was present.

Because a single site might show a file for various different letters of the alphabet, the data set for the letter files was subsequently pruned automatically to select only a single random file reference from each server name.

Results

A total of 1163 letter-named files were found, of which 189 (16.3%) showed the letter and 15 (1.3%) showed glyphs from other systems (three each from the Cyrillic alphabet and  hand signs, two from Japanese syllabics, and one from each of Morse Code, the Greek alphabet, semaphore, the Glagolitic alphabet, the IPA system, hieroglyphics, and Tamil). Files showing the letters were significantly more likely to be labelled with alt attributes (36.5% versus 16.2%) (chi-squared significance = 0.0101):
Null alt Other alt No alt Total
Showing letter 12 57 120 189
Not showing letter 60 198 716 974
Total 72 255 836 1163

Relatively few of the images showing letters were, in fact, used as initials; a common other use was as colouring-book pages designed to be printed out.

A total of 1511 shape-named files were found, of which 281 (18.6%) were images of the corresponding abstract shapes.

Files that were images of the abstract shapes were significantly less likely to be labelled with alt text (chi-squared significance = 0.0018), but the actual difference was quite small (32.0% versus 34.5%):
Null alt Other alt No alt Total
Abstract shape 29 61 191 281
Other 60 364 806 1230
Total 89 425 997 1511

In general, shape-named files showed significant differences by name in the extent to which they were assigned alt texts (chi-squared significance = 0.0004), ranging from 31.0% for "ball" to 40.6% for "star"; the difference in proportions was even greater when restricted to abstract-shape images, but was not statistically significant (chi-squared significance = 0.0619). The proportion of images of abstract shapes varied from 5.1% for "dot" to 28.6% for "arrow".

The median length of the shape-named files was 13 kilobytes. Short files (defined as those below the median) were significantly more likely to be images of the abstract shapes than were long files (those at or above the median) (31.1% versus 6.8%) (chi-squared significance = 0.0000):

A total of 85 files were found with the name "new" and 377 with the name "visit". Overall, 5.8% contained images of the name only, 41.1% of the name with additional text, 31.8% of other text without the name, and 21.2% of no text.
Name only Text including name Other text No text Total
"new" 4 32 35 14 85
"visit" 23 158 112 84 377
Total 27 190 147 98 462

Images of text equal to or containing the file name were significantly more likely to be assigned alt text than other images (53.0% versus 38.8%) (chi-squared significance = 0.0022). These images were also marginally significantly more likely to be assigned alt text than images containing text not including the file name (chi-squared significance = 0.0115). There was not, however, a significant difference between images containing text without the file name and images containing no text.

In viewing the records for "visit", it was observed that file sizes appeared to decline with progression through the Google results listing. This declining trend is clearly visible in the following chart: one-kilobyte files do not turn up until about rank 270, after which they are frequent; conversely, files larger than 100 kilobytes are encountered from time to time up to about rank 250, after which they disappear entirely.

Discussion

The alt values for the letter files tended to be short. The longest was "[The Norwegian oil and gas partners consists of over 150 companies that offer competitive state of the art technologies, products and services.]" (123 characters).

A common form consisted of the file name and extension followed by a byte count. This has sometimes been recommended for larger files, and, in fact, many of the letter files with such alt values appear to have been quite large.

For the shape-named files, alt texts were again generally short. The longest was "[ If using a graphical browser, or if viewing images, you'd be viewing an animated gif of some balls doing a cascade now - you're probably glad you never bothered aren't you]". The filename-bytecount format was rare for "arrow" but common enough for "ball", "bullet", "dot", and "star"; a simple file name plus extension without the byte count was also found in a number of instances.

There were no really long alt texts for "new". The longest for "visit" was "j.g&uuml;mbel TEXTIL-VERTRIEBS GMBH; Messerschmittstr. 22, Postfach 2629, 89216 Neu-Ulm, Telefon 0700/562645689, 0700/Jobmiloty, Telefax 0731/722069, E-Mail info@job-miloty.de" reproducing the text of a Visitenkarte (business card). The filename-plus-byte-count format was again fairly common.

That larger images should rank higher on average is not entirely surprising. If one assumes that users are more normally looking for pictures of objects, people, or scenes from the real world, then it is generally reasonable to assume that they will be more likely to be satisfied with more detailed images, such as will require more storage space in files. Links from other sites, which are commonly employed as one indicator in ranking Web objects for retrieval, seem more likely to be made to images that provide more detail.

A practical disadvantage for the present research is the likelihood of bias in the sampling method. Google cuts off search results at 500 items; links to items beyond that rank are not made available. It is probable that many of the files missed by the searches are small, simple images, which are thus under-represented in the results.

Conclusion

In this study, some of the expected relationships were indeed found to hold. Letter-named files were more likely to use the alt attribute if the images were of the letters. Shape-named files were smaller if they contained images of the abstract shapes. The second alternative for hypotheses 2 was proved, but the small practical difference suggests that the grounds suggested for either alternative are of approximately equal validity. For the role/directional names, the lack of a significant difference in alt text assignment between images containing text without the file name and images containing no text failed to support the suggested explanation.

References

Home

Last updated January 25, 2008, by Tim Craven