{ 1}


People looking for information often cannot identify the specific data or combination of data that they want at the beginning of their searches. It may be that their needs are not yet fully defined and they simply want to browse. Alternatively, they may not understand a retrieval system sufficiently, or it may simply be impossible for their needs to be expressed in its terms. As a result they will not be able to translate their needs into the appropriate specifications and they must scan to find what they want. Whether browsing or scanning, searchers need a place to start and guidance on where they are and what choices are open to them. They need maps of the terrain. There are different techniques for guiding people in this kind of searching. This book is concerned with the technique called string indexing.


To understand why string indexing is a powerful aid, not only for searchers of printed tools, but also, potentially, in online searching of electronic databases, one needs to know what string indexing is. First, however, a brief discussion of indexes and index displays will be useful.

An index is a specific kind of tool for finding information. Whether an index is used by a human being or by a machine, its essence is a list of index entries. Each index entry refers to an indexed item somewhere outside the index; for instance, to a record in a database, to a folder in a file drawer, or to a book on a library shelf. The entries should be in some recognizable order, usually alphabetical. Typically, an entry consists mostly of text, although it may also contain various kinds of coding.

An index entry may be divided into a description and a locator. The loca{ 2}tor is the part which actually points to the indexed item. The description provides some information on what will be found in the indexed item, and how the description begins generally determines where the entry will be found in the index. Take, for example, the back-of-book index entry

Here the locator, "14", points to page 14. The description, "HITCHCOCK, ALFRED JOSEPH, GASTRONOMIC LIFE. POTATOES", refers to a passage on page 14 concerning the director's lifelong devotion to potatoes. It also places the entry in the "H" section of the index; more exactly, with the other entries beginning "HITCHCOCK, ALFRED JOSEPH".

Most usually, what is represented by the description is a topic or subject of the indexed item, as is "potatoes in the gastronomic life of Alfred Hitchcock" above. Quite often, however, the description corresponds to a class of items to which the indexed item belongs; for example, an indexed item might be assigned a description putting it in the class of "manuals on constructing 'OO' and 'HO' gage model railways".

The description in an index entry contains one or more terms. Ideally, terms are names of things or of classes of thing which are connected to the indexed item in significant ways; for example, "HITCHCOCK, ALFRED JOSEPH", "GASTRONOMIC LIFE", and "POTATOES". Hence, they should be nouns or noun phrases. What should conveniently be recognized as a term, however, will vary with the indexing system being discussed. Thus, adjectives qualifying the things or classes may be categorized as terms in some systems.

If terms are eliminated from the decription part of an index entry, what remains are connectives. Connectives serve to separate the terms in the description and also often indicate how the things represented by the terms are related. The periods in the sample index entry above are examples of connectives.

In addition to the index entries, an index may contain other elements, such as cross-references. A cross-reference is like an index entry, except that it points to another place or places within the index; e.g.

While cross-references are not essential, they may be very useful, especially in larger indexes.

Only part of an index is used in any one search. In the course of searching, both people and machines can employ the recognizable ordering of the list to eliminate many entries from any consideration at all. Even when an index entry is considered, not all of it may need to be examined, especially if its parts are themselves arranged in a recognizable order.

{ 3} An index display is a visual display of an index, or of part of an index, which can be used by people directly, rather than by machines. It may be embodied in handwriting, typewriting, print, printout, microfiche, or CRT screens. Index displays are typically at least partly precoordinate; that is, the descriptions in the index entries contain two or more terms with connectives such as prepositions, conjunctions, or punctuation marks between them. By contrast, computerized retrieval software frequently uses postcoordinate searching: it makes several separate searches in its indexes and then combines the lists of locators obtained. For example, a precoordinate index might provide an index entry

indicating the subject of the information stored at location "633.082"; postcoordinate searching, on the other hand, might involve finding index entries for "BEES", "CROPS", and "POLLINATION" individually and then noting that "633.082" is common to all three terms. Boolean retrieval systems usually operate by means of postcoordinate searching.


String indexing systems are mostly the product of two factors: 1. the availability of computers to assist in the production of index displays; and 2. dissatisfaction with the inefficiencies, tedium, and inconsistencies of manual methods.

The kinds of intellectual tasks with which computers can most readily assist are those repeated over and over and involving rules which can be defined briefly and exactly. It is precisely this kind of task that human beings often find tedious and perform inefficiently, and in performing which they make errors obvious to users of the product. Among the tasks of index display production that fit into this category and which have been assigned to the computer in much recent indexing work are: the sorting of index entries into a recognizable order; the insertion of cross-references from an approved list; and the formating of the index for display.

To the tasks commonly assigned to the computer in index display production, string indexing systems add one more: the generation of a number of index entries to give at least some of the same information about an indexed item at different places in the index. This generation of multiple overlapping entries becomes especially tedious for human beings as the degree of precoordination increases.

For purposes of this book, string indexing will be defined fairly broadly. A string index will be a form of index with two main characteristics: 1. each indexed item { 4} normally has a number of index entries containing at least some of the same terms; 2. computer software generates the description part of each index entry "according to regular and explicit syntactical rules" (Svenonius 1978). The description part of a string index entry will be called an index string; the computer software that produces it, an index string generator.

The use of the same, or nearly the same, terms in a number of index entries for an indexed item does not mean that the entries are nearly identical; nor does it mean that they will be close together in the index. On the contrary, different entries will normally begin with different terms and appear at quite different places.  For example, an item about "pollination of crops by bees" with locator "633.082" might have the string index entries:

  1. BEES. POLLINATION of CROPS : 633.082
  2. CROPS. POLLINATION by BEES : 633.082
  3. POLLINATION of CROPS by BEES : 633.082
The first of these index entries will appear in the "B" section of the index; the second, in the "C"'s; and the third, in the "P"'s.

String indexing systems, with their multiple overlapping entries, may be contrasted, on the one hand, with single-entry systems and, on the other hand, with systems with non-overlapping multiple entries. Single-entry systems display only one index entry for each indexed item, in one place in the index; e.g.,

Searchers who look up other terms applicable to the indexed item may be directed to the single index entry by cross-references; e.g.,
In systems with non-overlapping multiple entries, index entries for the indexed item appear at more than one place in the index but typically give much less information; e.g.,
  1. BEES : 633.082
  2. CROPS : 633.082
  3. POLLINATION : 633.082
In all present-day string indexing systems, the index string generator produces multiple index strings from a single source description of the indexed item. This single source description is a string of terms and connectives, often with additional codes, or symbols meaningful to the software but not in any { 5}ordinary human language. Such a descriptive string will be called an input string. The index string generator applies the syntactic rules to the information given in the input string to determine with which term to begin an index string, what terms should follow, and how they should be connected. For example, software for one string indexing system could apply its syntactic rules to the input string
to produce, with the locator "633.082", the multiple overlapping index entries already illustrated. The first term in an index string will be called the lead term; a term in the input string that becomes the lead term of one or more of the index strings will be called an access term.


Probably the key unanswered question about string indexing is how it can be applied to online searching (Travis and Fidel 1982). Specific answers to this question may become apparent in time, especially if more research than the almost negligible amount so far is directed at it. Meanwhile, some idea of the kind of answer that might be expected can be taken from recent concepts in software design.

A person using a computer system needs to have a way of picturing the system, the information that it contains, and what is done with that information. This picture that the user builds about the system has been referred to by software engineers as the "user illusion" (Kay 1984). User illusions are important because of the impalpable nature of almost everything that happens in a computer system: without such illusions, following and remembering what the system does becomes virtually impossible for the human mind.

According to the "what you see is what you get" principle, the image on the screen should always be a faithful representation of the user illusion. Older interactive software does not capitalize on user illusions to the extent that newer software does, should, and will.

Viewing a retrieval system as if it were a printed index is one kind of user illusion, and online index displays can capitalize on this illusion. Because printed indexes will continue to be familiar from the backs of books, users should readily adopt the printed index illusion for some time to come. Thus the design of online index displays seems likely to be a worthwhile undertaking.

{ 6} The printed index illusion is especially useful because it allows the computer system designer to pack large amounts of information into a small display space. In using such a high-information display, the searcher enjoys the benefit of decreased effort for two reasons. First, no commands need to be sent to the computer system as long as the display continues to be useful; thus, no effort needs to be expended in selecting or typing commands. Second, because the display can remain the same for some time, the effort required of the user in adjusting to the change from one display to another is also minimized. From the point of view of the computer system, the user's terminal needs to be served less often, although, when it is, the amount of data transmitted to the terminal will be greater.

Software which promotes productive user illusions is especially helpful to inexperienced or infrequent users. Specifically, online index displays are likely to be especially useful to end-users of information, in comparison to professional intermediaries. The further development of online index displays may thus have important implications both for the broad accessibility of information and for the future employment of information professionals.


Whether online or not, a great advantage which string index displays have over other kinds of index display is that of being able to be generated with a relatively small investment of input data: many index entries can be derived automatically from a single input string plus locator. Thus, the effort of human indexers and others involved at the input stage can be saved, as well as some costs of data storage and transmission.

As already suggested, string index displays have other advantages. To understand these advantages further, however, as well as why different string indexing systems have been designed, it will be helpful to have a general idea of the aims of string indexing in relation to searching.

In general, searches for information will be more effective the more relatively useful information and the less relatively useless information the searchers find. Searches in index displays will be more efficient the less effort searchers have to expend to achieve equal effectiveness. Search effectiveness is often measured by dividing indexed items into those relevant to a given query and those not relevant. Two common measures of search effectiveness that make use of this division are the recall and precision ratios. The recall ratio is the proportion of relevant documents retrieved among all relevant documents; the precision ratio, the proportion of relevant documents retrieved among all retrieved documents.

From the searchers' point of view, the main purpose of the multiple overlapping index entries and explicit syn{ 7}tactic rules characteristic of string indexes is to improve the efficiency and effectiveness of searches.

Multiplication of index entries is aimed at increasing efficiency and the recall side of effectiveness, by providing searchers with more direct access to possibly useful information. Overlap is aimed at increasing both sides of effectiveness, by increasing the detail in index strings. Increased detail is expected to improve precision because a searcher may incorrectly assume that an insufficiently described item might be worth retrieving; recall, because a searcher may incorrectly assume that an insufficiently described item is not worth retrieving.

The explicit syntactic rules are designed to increase recall and efficiency and, to some extent, precision. The rules may accomplish these purposes in a number of ways: by making it easier for a searcher to predict what forms index entries will take and hence to decide for what to look; by bringing similar index entries closer together and separating dissimilar entries; by making the index strings less ambiguous or more quickly understood; by making it easier for searchers to break off reading part way through index entries which do not fit their needs; by allowing searchers to skip to the information in an index string that is most crucial to their decisions.

The foregoing discussion suggests some desirable characteristics of index strings from the searcher's point of view:

  1. predictability;
  2. collocation;
  3. clarity;
  4. succinctness;
  5. eliminability.

Predictability refers to the extent to which the searcher can predict the forms that relevant index strings will take and, more especially, where they are likely to be found in the index display.

Collocation here means the placing of similar index strings together and the separation of dissimilar index strings. The assumption is that searchers who have found one item of possible interest will be able to find a second similar item of possible interest more efficiently if it is close to the first. Searchers may also be able to correct such slips of the eye as omitting to scan one of the subheadings under a heading. Collocation facilitates such correction because the eye of a searcher examining one index element may be drawn incidentally to adjacent elements.

Clarity refers to how likely it is that the index string will not be misinterpreted and to how readily searchers can correctly grasp the meaning of the index string.

Succinctness represents the ratio of detail or specificity to the length of the index string. Excessive succinctness may of course detract from clarity.

Eliminability means how quickly a searcher can break off examining the index string if it is irrelevant. Eliminability partly depends on succinct{ 8}ness and clarity, especially toward the beginning of the index string; but it may also derive from other factors, such as the marking of suitable points for breaking off.

A couple of examples can be cited here.

First, compare "4-H CLUBS" to "CLUBS, 4-H". The second index string provides better collocation, to the degree that items relating to 4-H Clubs are similar to items relating to other kinds of club. On the other hand, it may be slightly inferior in clarity to the first, because the order of its elements inverts the order found in ordinary language. "CLUBS, 4-H" should have better predictability than "4-H CLUBS" in one sense and worse predictability in another: on the one hand, searchers are more likely to know where to find index elements beginning with "C" than index elements beginning with "4" in an alphabetical sequence; on the other, searchers may expect to read words in their ordinary-language order.

As a second example, the index string

is superior to
in clarity and eliminability: it is not likely to be misunderstood, or even read past the second word, by searchers interested in lifting equipment. It also implies better collocation, by separating the dissimilar items on the two kinds of cranes. On the other hand, it suffers from somewhat less succinctness if searchers can deduce from "USE in CONTROL of FROGS" what kind of crane is meant.


This section will discuss in somewhat more detail some of the aspects of searching string index displays already mentioned. While this discussion will provide useful background ideas, the remainder of the book may be understood without it.

1.5.1 Modes of search

A person searching an index display will do so in roughly three basic modes: 1. finding a sequence of entries which looks likely to { 9} contain useful information; 2. looking at the more prominent features of such entries, such as their initial words, in order to select those entries which still seem likely to be useful; 3. examining selected entries in detail in order to make final decisions on which indexed items to retrieve.  Modes of this sort have been given the labels "seeking", "scanning", and "screening" (Keen 1977c).

Different qualities of index strings tend to aid to different extents in different modes of searching. For example, predictability is especially important in the first mode; on the other hand, collocation and eliminability are more important in the second mode. Different parts of the index string tend to be used in different modes also; the first part, for instance, is usually most employed in the first mode. Hence, in evaluating an index string, different weight should be given to different qualities depending on the part of the string being considered. Many string indexing systems have index strings whose qualities differ quite properly from one part to another. Different syntactic rules may, in fact, be invoked depending on which part of the index string is being generated.

Parallel to the three modes of search, a useful indexed item may be missed broadly at three different stages: 1.  the searcher may never even see the index entry; 2. the searcher may begin to examine the index entry but break off and eliminate it prematurely; 3. the searcher may read the entire index entry but fail to follow it up. At any stage, a miss may occur because the searcher has: run out of time; been distracted by something else; misunderstood the meaning of information presented; or made an incorrect assumption about information which is not present.

1.5.2 Some factors affecting efficiency

Two things which waste searcher effort and so decrease efficiency are redundant information and irrelevant information. Redundant information is information which the searcher already has; irrelevant information is information which the searcher does not particularly need.

The main problem of redundant information in index displays occurs when searchers encounter entries for some indexed items that they have already found earlier in their searches. Especially vexing to searchers and especially avoided in many string indexing systems is the occurrence of two entries for the same indexed item close together in the display. Usually, the proximity of the entries would be the result of their both beginning in the same way; e.g.,


{ 10} The effect perceived by searchers has been referred to as "stuttering" (Lay 1973). Thus, a feature of string indexing systems which is generally good from the point of view of efficiency, namely the multiplication of index entries, is restricted because it might in extreme cases actually decrease efficiency.

Another kind of redundant information is information which searchers generally have before they begin their searches and of which they do not feel a need to be reminded. This kind of redundancy is very much dependent on the searcher. For instance, a searcher looking for information on the DIRAC computer language might already be familiar enough with it to know that it was used in searching on-line information retrieval systems. This searcher might thus find redundant most of the information in the index string

Someone else searching the same index might, however, find the additional information on DIRAC quite useful.

Irrelevant information is encountered by searchers principally when they read index entries for items which do not fit their needs. The sooner they can discontinue reading such index entries, the less irrelevant information they will encounter. An irrelevant index entry may nevertheless encourage a searcher to read on for a number of reasons: because the first part is not comprehensible, especially because the relationships between the concepts expressed are not clear; because the first part is understood incorrectly and thus appears to be leading in a suitable direction; because the first part, even when understood correctly, could belong to a relevant index entry, but happens not to do so; because the content is sufficiently interesting to distract the searcher's interest; or because the natural human tendency to read on to the end of a phrase or a sentence in ordinary language makes breaking off difficult.

Efficiency is also degraded to some extent by increased index bulk. Here again, the multiplication of index entries, designed to increase efficiency, may in extreme cases have the reverse effect.

1.5.3 Relationships between efficiency, recall, and precision

In considering the relationships between efficiency, recall, and precision, it is important to note what factors are varying. Three variable factors which may affect effectiveness and efficiency of searches in string index displays { 11} are: 1. features of the index display; 2. the searcher and the search; and 3. the stage of the search.

As index display features vary, recall should, all things being equal, increase as efficiency increases. First, increased efficiency allows searchers to devote more of their available time and effort to selecting useful information rather than rejecting useless information. Second, by providing more frequent psychological reinforcement, it encourages searchers to devote more time and effort to continued searching.

By contrast, as the search stage varies, an inverse relationship between recall and efficiency is to be expected. Recall should increase as more and more of the useful items are found; at the same time, as less and less likely paths are pursued and indexed items which are already known are encountered more and more, the efficiency of the search should decrease. This tradeoff will, however, be observed only in certain stages of a search; namely, those in which searchers have sufficiently focused on what they want and how they should best proceed to get it.

As the searcher and the search vary, recall and precision may tend to vary inversely, as they do in traditional information retrieval tests. The kinds of assumptions which searchers make about information will vary with the aims of their searches: a searcher interested in "everything" on a topic is likely to make much more "optimistic" assumptions related to possible relevance of an indexed item; one looking for "one good article" is inclined to be more "pessimistic". The more "optimistic" the searchers' assumptions are, the lower the precision and the higher the recall will be; the more "pessimistic" the assumptions, the lower the recall and the higher the precision. The more detailed the index strings are, however, the less this classic recall-precision tradeoff is likely to be apparent: opportunities for either "optimistic" or "pessimistic" assumptions will be too few.

As the search stage varies, an inverse recall-precision relationship should not be expected: even though recall should increase as a search progresses, no general reason requires precision to decrease. Whether precision tends to change during searches and in what direction will often depend on how searchers tend to modify their queries and their perceived needs in response to the information that they encounter. One can also imagine a searcher initially looking under an only marginally appropriate heading and later finding a much more useful heading. Under the initial heading, because what is available all seems marginal, the searcher may, in the hopes of retrieving at least some information, select indexed items most of which are useless. Upon finding the better heading, however, the searcher will not only retrieve many more of the relevant indexed items, but will likely also feel free to select only those which are most clearly relevant. The result is an increase in both precision and recall as the search progresses. { 12}


A number of string indexing systems use quite simple input strings with little or no coding. But others have complex and varied coding conventions, and their input strings may not be easy to understand. One of the functions of the coding is often to indicate connections between terms which appear at some distance in the input string, connections which will be reflected in one or more of the index strings. As a result, even though the input strings look like simple linear structures, they may in fact represent tangled networks of terms and connections between terms. For this reason, I have devised two-dimensional network diagrams for representing the information given in input strings. This section will first explain the basic conventions used by these diagrams and then give an indication of their application to understanding string indexing.

1.6.1 Conventions

Terms will be represented in upper case; for example, the terms used to describe an indexed item on "the effects of pollution of rivers on the migration of salmon" are represented as:



An access term will be marked with an asterisk ("*"); thus, if "effects" is assumed to be too general and abstract an idea to be looked up in the index display but the remaining terms are considered suitable access terms, the sample terms are marked as:




A link, or direct connection between two terms, will stand for a relationship between the things that the two terms in a description represent. A link will be represented by a broken line. Since there will be different types of link between terms, labels in lower case will usually be added; e.g.,

indicates an "of" type of link between "EFFECTS" and "POLLUTION". The link will usually be seen as going from left to right or from top to bottom; the second term, as qualifying, or limiting the scope of, the first; and the lower-case label, as indicating in what respect the second term limits the { 13} scope of the first. A link forward from the first term to the second term in order to restrict the scope of the first term can be said to fulfil the qualification function. In practice, the second term will not necessarily qualify the first or the qualification may in fact be reciprocal.

The overall subject "effects of river pollution on salmon's migration" is represented as

Here, as elsewhere, no particular significance should be attached to which links are horizontal and which vertical; use of both horizontal and vertical links simply allows clearer diagrams.

Identical terms in different input strings do not usually represent the same thing or class of things, even when the terms are not explicitly qualified. For example, the bees in an item on "crop pollination by bees" are not necessarily the same bees as in an item on "experimental studies of perception of colors by bees". This sort of lack of distinction will be carried over into the diagrams. Thus, the same label "BEES" is used indiscriminately in constructing diagrams for both topics:


One type of term usually does represent the same thing regardless of the description in which it appears; namely, a term which on its own means one { 14} particular entity, such as an individual author, document, building, or geographical location. For example, "AUSTRALIA" usually refers to the same geographical location.

In practice, the division of a description into terms and links is not always straightforward. Adjectives are the most notable problem. Some string indexing systems allow adjectives only as parts of multiword terms; e.g., "NONNUMERIC DATA", "PUBLIC LIBRARIES", "VOLATILE COMPOUNDS". Others treat them almost as separate terms for purposes of access and manipulation. Where useful in order to keep closer to the actual wording of the input and index strings, the network diagrams will also treat certain adjectives as terms.

1.6.2 Application

Two ways in which a network diagram can be used as an aid are: 1. in imagining how the index string generator builds index strings; and 2. in conceptualizing the mental paths followed by searchers as they examine index strings.

The index string generator can be viewed as building up an index string by starting at an access term and then moving through the network diagram according to a set of rules. A step in the index string generator's traversal of the diagram would involve: examining a term in the structure; possibly making an appropriate addition to the index string; noting the term for future reference; then following a link to another term or returning to a term previously examined for further processing.

As an illustration, consider the production of an index string beginning with "POLLUTION" from the network diagram representing "effects of river pollution on salmon migration" above. The index string generator begins at the term "POLLUTION", which it places at the beginning of the index string:

It sees only one link forward from "POLLUTION", that to "RIVERS". The index string generator indicates that it is following this link forward by adding " of " to the index string:
"RIVERS" is now also added to the index string:
No further links proceed forward from either "RIVERS" or "POLLU{ 15}TION". Thus, the link from "EFFECTS" to "POLLUTION" is now followed backward, with a period being inserted to indicate this:
The forward link from "EFFECTS" to "MIGRATION" then gives
Finally, the link from "MIGRATION" to "SALMON" produces
All the terms in the network have now been visited, and the index string is complete.

A full set of index strings generated in this way is:

Here, as throughout this book, the index strings produced by an index string generator are presented in alphabetical order rather than the order in which they are, in fact or in theory, generated.

Note how a link followed forward in producing one index string is often followed backward in producing another. For example, when the access-point term is "RIVERS" the same links are followed in the opposite direction from that when the access term is "SALMON":


               -->               -->
^    |
|    on
^    |
|    of
vs. { 16}

               <--               <--
|    |
v    on
|    |
v    of

When one of the network diagrams is used as an aid in conceptualizing the search process, a term can be viewed as representing an idea which ought to occur to the searcher. Likewise, a sequence of terms joined by links, usually backward, from one to the next, can be viewed as a path of ideas that the searcher should follow in examining an index string. Each step along the path brings the searcher closer to the idea of the indexed item; and at each step the searcher may turn back, by ceasing to read the index string, and try another path if the sequence of ideas does not appear to be leading in a useful direction.

The index string for "SALMON" above will illustrate. The sequence of linked terms is: "SALMON" --> "MIGRATION" --> "EFFECTS". The corresponding path of ideas in the searcher's mind should be from "salmon" through "migration of these salmon" to "effects of river pollution on this salmon migration".  The searcher reading the index string is thus led from a fairly general idea to the quite specific topic of the indexed item.

Chapter 1 Summary

This book is concerned with the technique for helping people to find information known as string indexing. To understand string indexing, one must first understand what indexes and index displays are and of what parts they consist. A string index is characterized by multiple overlapping index entries for an indexed item and by computer generation of the description parts of the entries according to explicit syntactic rules.

Although the applicability of string indexing to online searching is not yet proved, the user illusion concept is suggestive of what that applicability might be.

From the searcher's viewpoint, the main purposes of the two essential characteristics of string indexes are to increase the effectiveness and efficiency { 17} of searches. Qualities of index strings which should contribute to these purposes are predictability, collocation, clarity, succinctness, and eliminability.

Some background ideas on the search process may be useful. Someone searching an index display will do so in different modes, and different qualities and parts of index strings are more suited to different modes of search. Search efficiency is decreased by redundant and irrelevant information and by index bulk. In considering the relationship between efficiency, recall, and precision, it is important to note what else is varying: e.g., features of the index display, the searcher and the search, or the stage of the search.

To represent the information in string indexing input strings, this book will occasionally employ network diagrams. Conventions of these diagrams include: terms in upper case; asterisks before access terms; and dashed lines, labeled in lower case, for links between terms. The diagrams may help in imagining both the process of index string generation and the mental paths followed by searchers of string indexes.

<-- Preface Contents Chapter 2: Survey of String Indexing Systems and Their Relatives -->