{ 18}

CHAPTER 2
SURVEY OF STRING INDEXING SYSTEMS AND THEIR RELATIVES

To understand the examples given in later chapters, it will be helpful to have at least some acquaintance with the specific string indexing systems cited. Thus, this chapter will survey a variety of string indexing systems and give examples of their basic input and output. Because of the number of different systems and the complexity of some of them, no more than an overview can be given. Those interested in full details should consult the documentation noted in the References.

String indexing systems will be divided roughly into three categories according to type of input string: phrases in ordinary language; simple lists of terms; and strings containing additional codes as instructions to the software. This division has been possible because, though string index generation without input strings can be envisioned, all present-day string indexing systems do have input strings.

A final section will give some background on other indexing methods which do not quite fit the definition of string indexing used in this book. These close relatives of string indexing are included here for purposes of comparison and because of their influence on the design of systems which do fit the definition.

2.1 STRING INDEXING WITH ORDINARY-LANGUAGE INPUT STRINGS

The simplest and most common kind of string indexing software is designed to use expressions in ordinary language as input strings. These expressions may be unmodified titles of documents, descriptions composed by an indexer, { 19} or hybrids of the two. A number of the systems outlined in this section do have variants with some coding conventions, but their basic forms employ uncoded input strings.

The software of these ordinary-language systems analyzes input strings more or less crudely into their components. It also more or less recognizes the components as connectives, access terms, and so on. In doing so, it may make reference to such characteristics as their length or to various types of ancillary input, such as a stoplist or a golist. A stoplist is a list of terms which cannot be access terms; a golist is a list of terms which should be access terms. The access terms recognized are usually individual words and these words are referred to as keywords.

2.1.1 Cycling and KWIC

One version of KWIC (KeyWord In Context) (Luhn 1960) is perhaps the earliest string indexing system.  A variant on the basic KWIC index generation process is, however, slightly easier to explain and so will be dealt with first. This variant is best called cycling.
     Given the sample input string
EXTRUSION TEXTURIZING OF SOY MEAL
a typical cycled index generator produces the four index strings:
  1. EXTRUSION TEXTURIZING OF SOY MEAL /
  2. MEAL / EXTRUSION TEXTURIZING OF SOY
  3. SOY MEAL / EXTRUSION TEXTURIZING OF
  4. TEXTURIZING OF SOY MEAL / EXTRUSION

The syntactic rules that define what index strings the cycled index string generator produces can be stated very briefly.  A cycled index string consists of a keyword from the input string, the part of the input string following the keyword, a dividing symbol, and the part of the input string preceding the keyword.  If no part of the input string happens to follow the keyword, as in index string 1 above, or to precede the keyword, as in 2, the index string will of course look somewhat simpler, since one part of it will be empty.

In classic KWIC, the basic output of the cycling process is displayed so that the part of the title preceding the keyword continues to precede the keyword in the displayed index entry. In the simplest form, parts of the index string which would not fit onto a single line are omitted. Thus, the sample index strings might appear as:

  1.                  EXTRUSION TEXTURIZING OF
  2.  RIZING OF SOY   MEAL
  3.  EXTURIZING OF   SOY MEAL
  4.      EXTRUSION   TEXTURIZING OF SOY MEAL
{ 20} Here, "SOY MEAL" is missing from index string 1 and "EXTRUSION" and parts of "TEXTURIZING" from 2 and 3 because the index strings are truncated to fit into the space available.

Truncation is not usually a severe problem in practical KWIC index displays, where the space allowed for an index string is normally at least 60 characters, and often over 100. Allowing too much space for KWIC index string display creates its own problems, however, those of too much empty, or "white", space.

The first disadvantage of excessive white space is of course a bulkier index. A second problem, however, is created because the locators in a KWIC index generally appear in a separate column on the righthand side of the display. If there is too much empty space between the ends of the index strings and the corresponding locators, searchers may find it quite difficult to match the correct locator with an index string (Fischer 1966).

One response to the white-space and truncation problems has been to "wrap around" the KWIC index strings. While there are different versions of wrap-around KWIC, a typical effect, provided the input string is not too long, is of cycled index strings with their last parts chopped off and reattached in front; e.g.,

  1. OF SOY MEAL /   EXTRUSION TEXTURIZING
  2. RIZING OF SOY   MEAL / EXTRUSION TEXTU
  3. EXTURIZING OF   SOY MEAL / EXTRUSION T
  4. L / EXTRUSION   TEXTURIZING OF SOY MEA

Another solution to the locator-matching problem, the filling out of the empty spaces with dots, does not reduce the bulk of the index (Chernyi and others 1969). More recent suggested variants allow multi-line index strings, as in the KWIC-style experimental library catalog tested at the Bath University Centre for Catalogue Research (Prowse 1983, pp. 7-9).

2.1.2 KWOC

Like KWIC, KWOC (Keyword Out of Context) includes a number of related, but slightly varied, systems. In the basic KWOC system, an index string consists of a keyword plus the unmodified input string; e.g.,
  1. EXTRUSION
      EXTRUSION TEXTURIZING OF SOY MEAL
  2. MEAL
      EXTRUSION TEXTURIZING OF SOY MEAL
  3. SOY
      EXTRUSION TEXTURIZING OF SOY MEAL
  4. TEXTURIZING
      EXTRUSION TEXTURIZING OF SOY MEAL

{ 21} KWOC entries may be formated to place the initial terms on separate lines from the rest, as above, or in a separate column.

In a variation of KWOC, a mark of omission (for example, an asterisk, "*") is substituted for the repeated access term:

  1. EXTRUSION
      * TEXTURIZING OF SOY MEAL
  2. MEAL
      EXTRUSION TEXTURIZING OF SOY *
  3. SOY
      EXTRUSION TEXTURIZING OF * MEAL
  4. TEXTURIZING
      EXTRUSION * OF SOY MEAL

Truncation is unusual in KWOC systems, and the index string is often displayed over several lines. One exception is a system used by the MITRE Corporation, which truncates both longer lead terms and titles (Feinberg 1973, pp. 119-122).

2.1.3 PANDEX

PANDEX (Lay 1973, 36-38; Feinberg 1973, pp. 144-147) is a development of KWOC with two notable features. The first feature is simply that the software sometimes changes the lead term to a standard form to provide better collocation. The second feature is that index strings with the same lead term are subarranged by the keyword that is in closest proximity to the lead term in the input string; if two keywords are equally close, preference is given to the later one. The repeated lead term and the subarranging keyword are emphasized by capitalization or boldface.  Thus, on the one hand, the index strings for the sample title are displayed as:
  1. EXTRUSION
       EXTRUSION TEXTURIZING of soy meal
  2. MEAL
       Extrusion texturizing of SOY MEAL
  3. SOY
       Extrusion texturizing of SOY MEAL
  4. TEXTURIZING
       EXTRUSION TEXTURIZING of soy meal
On the other hand, they are sorted as though they were:
  1. MEAL / SOY
  2. EXTRUSION / TEXTURIZING
  3. SOY / MEAL
  4. TEXTURIZING / EXTRUSION

The results of the two features can be seen in the following short extract { 22} from a PANDEX index (Feinberg 1973, p. 145), which also truncates the subheading part of each index string to fit a standard column size:

ADSORPTION
of polyglutamic acid adsorbed on char
changes induced by alkali adsorption
methods. IV. Adsorption of aromatic 
Adsorption characteristics of water so-
Chromatographic adsorption constants
in adsorption chromatography on alum-

2.1.4 PERMUTERM

PERMUTERM (Garfield 1976) is a specific system developed by the Institute for Scientific Information (ISI) for indexing Science Citation Index and others of its publications. Each index string in a PERMUTERM index consists basically of a term from the input string followed by another term from the same input string. The first term will, however, be truncated after 18 characters and the second after 11 characters (Feinberg 1973, pp. 140-144). Moreover, not every possible pair of terms is used: not only does a stoplist exclude certain terms from the first position; certain terms are also excluded from the second position and some further term pairs are disallowed. PERMUTERM index strings for the sample title might be:
  1. EXTRUSION
       MEAL
  2. EXTRUSION
       SOY
  3. EXTRUSION
       TEXTURIZING
  4. MEAL
       EXTRUSION
  5. MEAL
       SOY
  6. MEAL
       TEXTURIZING
  7. SOY
       EXTRUSION
  8. SOY
       MEAL
  9. SOY
       TEXTURIZING
  10. TEXTURIZING
       EXTRUSION
  11. TEXTURIZING
       MEAL
  12. TEXTURIZING
       SOY

{ 23}

2.1.5 Double-KWIC

Similar production of more than one index string beginning with the same keyword from the same input string is seen in the Double-KWIC technique (Petrarcha and Lay 1969a, 1969b; Lay 1973; Lay and Petrarcha 1970). In spite of its name, Double-KWIC is actually a combination of cycling with modified KWOC. Input strings are generated in two stages: first, an intermediate set of strings is derived from each input string by cycling; second, a modified KWOC procedure is applied to each intermediate string. Access terms are recognized from a golist supplied by the index producer; the following example assumes that the words "EXTRUSION", "MEAL", "SOY", and "TEXTURIZING" are all on the Double-KWIC software's golist, while phrases such as "EXTRUSION TEXTURIZING" and "TEXTURIZING OF SOY" are not:
  1. EXTRUSION
       MEAL / * TEXTURIZING OF SOY
  2. EXTRUSION
       SOY MEAL / * TEXTURIZING OF
  3. EXTRUSION
       TEXTURIZING OF SOY MEAL / *
  4. MEAL
       EXTRUSION TEXTURIZING OF SOY *
  5. MEAL
       SOY * / EXTRUSION TEXTURIZING OF
  6. MEAL
       TEXTURIZING OF SOY * / EXTRUSION
  7. SOY
       EXTRUSION TEXTURIZING OF * MEAL
  8. SOY
       MEAL / EXTRUSION TEXTURIZING OF *
  9. SOY
       TEXTURIZING OF * MEAL / EXTRUSION
  10. TEXTURIZING
       EXTRUSION * OF SOY MEAL
  11. TEXTURIZING
       MEAL / EXTRUSION * OF SOY
  12. TEXTURIZING
       SOY MEAL / EXTRUSION * OF

A later modification (Lay 1973) allows the index producer to specify a mixture of KWOC and Double-KWIC. Double-KWIC is followed for access terms which apply to more than a given number of indexed items, while for rarer lead terms KWOC is adopted.

{ 24}

2.1.6 Articulated Subject Index

A more complex procedure for generating index strings from titles or title-like phrases is that of the Articulated Subject Index (ASI), developed by Lynch and others (Armitage and Lynch 1967, 1968; Lynch 1969; Lynch and Petrie 1973). An ASI index would have index strings such as:
  1. EXTRUSION TEXTURIZING, OF SOY MEAL
  2. MEAL, SOY, EXTRUSION TEXTURIZING OF
  3. SOY MEAL, EXTRUSION TEXTURIZING OF
  4. TEXTURIZING, EXTRUSION, OF SOY MEAL
As in cycling, KWIC, and KWOC, each index string normally contains all the terms of the input string. The ASI index string generation process basically represents an elaboration of cycling. Instead of dealing immediately with the whole input string, however, the ASI index string generator first cycles a small segment of the input string and then moves on to deal with the rest of progressively larger segments that contain the first.

Relatively simple input strings can be segmented in ASI in only one way; for example, "EXTRUSION TEXTURIZING OF SOY MEAL" can be segmented only as follows:

EXTRUSION TEXTURIZING   OF   SOY MEAL
Such input strings thus always yield the same index strings. For more complex input strings, the ASI program must choose which of several "variant" index strings to produce for a given access term; the basis for the choice is how many index strings beginning in the same way can be produced for other indexed items. The aim is to improve collocation.

2.1.7 KWPSI

Perhaps the most sophisticated string indexing system designed to accept uncoded ordinary-language input strings is KWPSI (Key Word/Phrase Subject Index) (Vladutz and Garfield 1979). The KWPSI software analyzes an input string in a rough form of linguistic parsing. Short lists of prepositions, conjunctions, auxiliary verbs, articles, pronouns, and some other words are used, plus a list of non-access terms from ISI's PERMUTERM software. Each word not on one of the lists of non-access terms becomes an access term.  KWPSI index strings are similar to those for ASI.  With the aim of decreasing index bulk, however, generation of an index string may be halted at certain well defined points before all the input string { 25} terms have been included. For example, possible index strings for the title "A FUNCTION GENERATOR USING INTEGRATED CIRCUITS" using the basic version of KWPSI are:
  1. INTEGRATED / CIRCUITS
  2. CIRCUITS / INTEGRATED *
  3. GENERATOR / A FUNCTION * USING INTEGRATED
  4. CIRCUITS

2.2 STRING INDEXING WITH TERM-LIST INPUT STRINGS

A few string indexing systems are designed for input strings consisting of unconnected terms. Simple lists of keywords can, of course, be used quite successfully as input strings in KWIC and KWOC systems, PERMUTERM, and Double-KWIC; and some index string generators for term-list input strings are quite similar to those for ordinary-language input strings. Some features, however, set them apart.

Take, for example, the subject indexes to CLASE (Citas Latinoamericanas en Sociología, Economía y Humanidades). These indexes are produced by a PERMUTERM-like index string generator from lists of subject headings assigned to each indexed item. Because, however, the subject headings are quite often two or three words long, more space is allowed for them in the index strings than is allowed for terms in ISI's PERMUTERM; thus, truncation is avoided. For instance, an article entitled "La Administración de las Provincias Senatoriales Romanas", assigned the term list

has the index strings:
  1. ADMINSTRACION PUBLICA
       PROVINCIAS
  2. ADMINSTRACION PUBLICA
       ROMA
  3. ADMINSTRACION PUBLICA
       SENADO
  4. PROVINCIAS
       ADMINISTRACION PUBLICA
  5. PROVINCIAS
       ROMA
  6. PROVINCIAS
       SENADO { 26}
  7. ROMA
       ADMINISTRACION PUBLICA
  8. ROMA
       PROVINCIAS
  9. ROMA
       SENADO
  10. SENADO
       ADMINISTRACION PUBLICA
  11. SENADO
       PROVINCIAS
  12. SENADO
       ROMA

A list of terms can be given a simple, easily predicted order by being alphabetized. The ABC-Spindex system (Falk and Baser 1980), by which ABC-Clio indexes America: History and Life, uses such a list of alphabetized keywords, applying a modified KWOC procedure to produce the index strings. Because the terms are alphabetized and unconnected, no symbol of omission is required. For example, an indexed item dealing with "Thomas Allen and the American Revolution in New England from 1774 to 1777" is assigned the term list

As a result, it has the index strings:
  1. Allen, Thomas. American Revolution. New England. 1774-77.
  2. American Revolution.  Allen, Thomas. New England. 1774-77.
  3. New England. Allen, Thomas. American Revolution. 1774-77.
Note that date terms, such as "1774-77", are not access terms.

Two systems, TABLEDEX (Ledley 1958) and SLIC (Sharp 1966) retain strict alphabetical order in every index string by limiting the terms which follow the lead term to ones which file after it.

The TABLEDEX index string generator produces one index string for each access term in the input string, omitting only the terms which precede the access term. Thus, an article assigned the term list

{ 27} has the index strings:
  1. analysis: application: differentiation: gas: netherlands: nuclear: theory
  2. application: differentiation: gas: netherlands: nuclear: theory
  3. differentiation: gas: netherlands: nuclear: theory
  4. gas: netherlands: nuclear: theory
  5. netherlands: nuclear: theory
  6. nuclear: theory
  7. theory

The SLIC (Selected Listing In Combination) string index generator, by contrast, like PERMUTERM and Double-KWIC, routinely produces more than one index string for a single access term in the input string. For example, the index strings produced for the term list

are:
  1. EFFECTIVENESS : INDEXING : RETRIEVAL : THEORY
  2. EFFECTIVENESS : INDEXING : THEORY
  3. EFFECTIVENESS : RETRIEVAL : THEORY
  4. EFFECTIVENESS : THEORY
  5. INDEXING : RETRIEVAL : THEORY
  6. INDEXING : THEORY
  7. RETRIEVAL : THEORY
  8. THEORY
The index strings are generated very simply by taking every possible combination of input string terms which includes the last term in the input string. Note how "THEORY" appears in every index string in the list above.

MULTITERM (Skolnik 1970, 1972) is also discussed in this section, even though it is somewhat difficult to view the elements of its input strings as consisting just of terms. Each input string element does consist mostly of a term of one or more words; but, in addition, this term may have appended to it a space-plus-hyphen (" -") followed by a one- or two-letter role code. In general, a role code is a code indicating how a term is related to the rest of a description. Examples in MULTITERM are: "Q", indicating that the term names a "property" or "quality" of something mentioned earlier in the input string; "D", for "determination"; and "U", for "use". Other string indexing systems employ role codes which are more clearly separate from the terms and which help give instructions to the software. { 28}

MULTITERM terms, with their role codes where appropriate, are separated by slashes ("/"); a double slash ("//") marks the end of the input string. For instance, the MULTITERM input string for a document on "the study of the structure of graphite fibers using an X-ray method" is

Fiber:Graphite -Q/Structure -D/Test Method -U/X-Ray -U//
The MULTITERM software produces the index strings by simply cycling the elements marked off by the slashes:
  1. Fiber:Graphite -Q/Structure -D/Test Method -U/X-Ray -U//
  2. Structure -D/Test Method -U/X-Ray -U//Fiber:Graphite -Q
  3. Test Method -U/X-Ray -U//Fiber:Graphite -Q/Structure -D
  4. X-Ray -U//Fiber:Graphite -Q/Structure -D/Test Method -U
Note here the absence of any index string beginning with "Graphite"; the reason is that "Graphite" is not a term in itself but only part of the term "Fiber:Graphite".

2.3 STRING INDEXING WITH CODED INPUT STRINGS

Index system designers early saw that codes could be added to title-like phrases or lists of terms to increase the indexer's control over the output. For example, Dowell and Marshall (Dowell and Marshall 1962) describe a KWOC-like procedure applied to lists of terms in which the indexer can identify a term as to appear only in the subheading or only at the beginning of an index string. Similarly, East and others (East and others 1963) report producing a KWIC index from titles in which access terms are marked by a special character.

2.3.1 Statement Indexing

A proposal by Yeats for a system called Statement Indexing (Yeats 1964) shows a further move in the direction of more formalized coded input. Statement Indexing imposes a logical sequence on the parts of input strings, mostly on the basis of 15 "syntactic categories", including "substance", "aspect", "process", "agent", and "locus". As in a number of more sophisticated string indexing systems, the indexer in Statement Indexing employs codes to divide not only individual terms, but also larger segments of the index string. The indexer indicates divisions between larger segments by dashes (" - ") or double periods (".."); slashes ("/") and commas mark divisions between smaller segments.  A dividing symbol is often followed by { 29} a preposition or abbreviation indicating the type of term introduced; for example, "- by" introduces an "agent" term,"- in" introduces a "locus" term, and "/f." means "from".  A Statement Indexing input string for a document on "effects of hay from alfalfa on the contents of the rumen of cattle" is
Rumen .. contents .. affected - by hay /f. alfalfa - in cattle

Given that each dash in the input string identifies an access term, index string generation in Statement Indexing is basically KWOC-like. Two major differences should be noted, however. These differences can be illustrated by the index strings corresponding to the sample input string above:

  1. Cattle: Rumen ..  contents .. affected - by hay /f. alfalfa - in cattle
  2. Hay/f.  alfalfa: Rumen .. contents .. affected - by hay /f. alfalfa - in cattle
  3. Rumen .. contents .. affected - by hay /f. alfalfa - in cattle
Index string 2 illustrates the first difference: it is not necessarily just one word or term that becomes the heading, but everything between the dash and the next major dividing symbol, excluding only the introductory connective. Index string 3 illustrates the second difference: the unmodified input string also becomes one of the index strings.

Access via terms introduced by slashes in the input string is provided by cross-references; e.g.,

Alfalfa - hay see Hay/f.  alfalfa

2.3.2 Automated Library Catalog Displays

A number of systems for producing library catalog displays and similar bibliographic tools satisfy the basic definition of string indexing systems with coded input strings. True, they may not be widely recognized as such, and a general book on string indexing cannot consider their many varieties.  But typical features can be outlined in brief.

Many library catalog systems apply some version of the MARC (MAchine Readable Catalog) standard to their input strings. The indexer, or cataloger, does not have to construct, or even see, the input string in its raw MARC format, which is more or less incomprehensible except to experts. Nevertheless, catalogers must often know and employ many of the MARC codes in their input.

Associated with most of the larger segments, or fields, of a MARC input string are three-digit codes identifying the aspect of the indexed item being described. Examples are: "100", "main entry", usually the name of the { 30} principal author; "245", title; "250", edition; "260", publication details; "300", physical description; "504", note on bibliographies and indexes contained; "650", "topical" subject.

Smaller segments, or subfields, may be marked with two-character codes such as "‡b", "‡c", and "‡2", whose meaning varies depending on where they occur (the first character, variously displayed in practice, is represented here by the double dagger, "‡"). Other codes appear in fixed positions near the beginning of the input string or the beginnings of fields. For example, "10" at the beginning of the "100" field signifies "single surname, not a subject"; "14" at the beginning of the "245" field, "access term, first 4 characters disregarded in sorting"; "0" at the beginning of the "260" field, "publisher or the like named"; " 0" at the beginning of a "650" field, "not specified as primary or secondary, Library of Congress subject heading".

Much of the information in the first part of a MARC input string is both heavily coded and not necessary for the production of index strings. Thus, the following sample display is only of information from the later part:

100 10  Tuchman, Barbara Wertheim.
245 14  The march of folly : ‡b from Troy to Vietnam / ‡c Barbara W. Tuchman
250     1st ed.
260 0   New York : ‡b Knopf : Distributed by Random House, ‡c 1984.
300     xiv, 447 p., [32] p. of plates : ‡b ill. (some col.) ; ‡c 24 cm.
504     Includes bibliographies and index.
650  0  History, Modern
650  0  History ‡x Errors, inventions, etc.
650  0  Power (Social sciences)
650  0  Judgment.
(OCLC 1984, p. Intro:2).

MARC is not in itself a string indexing system; it is not even a complete system for input strings. Catalogers also consult: 1. an extensive set of "descriptive" cataloging rules on how aspects of the indexed item other than its subjects are to be described; 2. a list of subject headings; 3. at least one library classification scheme.  The second edition of the Anglo-American Cataloging Rules is the common present-day standard for "descriptive" cataloging. For the other aspects of the description, which relate mainly to subject matter, the Library of Congress Subject Headings and either the Dewey Decimal Classification or the Library of Congress Classification are widely used.

After these extensive instructions for input strings, index string generation is relatively simple. The index string generation process is similar to that for { 31} Statement Indexing, which in turn was probably influenced by earlier manual library catalog practice. The main difference is in the MARC codes, which all drop out, with a small amount of formating and numbering taking their place. For example, typical index strings for the sample input string partly displayed above are:

  1.     HISTORY - ERRORS, INVENTIONS, ETC.
    Tuchman, Barbara Wertheim.
       The march of folly : from Troy to Vietnam / Barbara W. Tuchman.  1st ed. New York : Knopf : Distributed by Random House, 1984.
       xiv, 447 p., [32] p. of plates : ill. (some col.) ; 24 cm.
       Includes bibliographies and index.

    1. History, Modern. 2. History - Errors, inventions, etc. 3. Power (Social Sciences). 4. Judgment. I. title.

  2.     HISTORY, MODERN
    Tuchman, Barbara Wertheim.
       The march of folly : from Troy to Vietnam / Barbara W. Tuchman.  1st ed. New York : Knopf : Distributed by Random House, 1984.
       xiv, 447 p., [32] p. of plates : ill. (some col.) ; 24 cm.
       Includes bibliographies and index.

    1. History, Modern. 2. History - Errors, inventions, etc. 3. Power (Social Sciences). 4. Judgment. I. title.

  3.     JUDGMENT
    Tuchman, Barbara Wertheim.
       The march of folly : from Troy to Vietnam / Barbara W. Tuchman.  1st ed. New York : Knopf : Distributed by Random House, 1984.
       xiv, 447 p., [32] p. of plates : ill. (some col.) ; 24 cm.
       Includes bibliographies and index.

    1. History, Modern. 2. History - Errors, inventions, etc. 3. Power (Social Sciences). 4. Judgment. I. title.

  4.     March of folly
    Tuchman, Barbara Wertheim.
       The march of folly : from Troy to Vietnam / Barbara W. Tuchman.  1st ed. New York : Knopf : Distributed by Random House, 1984.
       xiv, 447 p., [32] p. of plates : ill. (some col.) ; 24 cm.
       Includes bibliographies and index.

    1. History, Modern. 2. History - Errors, inventions, etc. 3. Power (Social Sciences). 4. Judgment. I. title.

  5.     POWER (SOCIAL SCIENCES)
    Tuchman, Barbara Wertheim.
       The march of folly : from Troy to Vietnam / Barbara W. Tuchman.  1st ed. New York : Knopf : Distributed by Random House, 1984.
       xiv, 447 p., [32] p. of plates : ill. (some col.) ; 24 cm.
       Includes bibliographies and index.

    1. History, Modern. 2. History - Errors, inventions, etc. 3. Power (Social Sciences). 4. Judgment. I. title.

  6.     Tuchman, Barbara Wertheim.
    Tuchman, Barbara Wertheim.
       The march of folly : from Troy to Vietnam / Barbara W. Tuchman.  1st ed. New York : Knopf : Distributed by Random House, 1984.
       xiv, 447 p., [32] p. of plates : ill. (some col.) ; 24 cm.
       Includes bibliographies and index.

    1. History, Modern. 2. History - Errors, inventions, etc. 3. Power (Social Sciences). 4. Judgment. I. title.

2.3.3 PRECIS

Perhaps the most recognized string indexing system emphasizing coding of input strings is PRECIS (PREserved Context Index System), developed for the British National Bibliography by Austin and others (Austin 1974a, 1974b; Austin and Dykstra 1984).  There are two major versions of PRECIS, making use of distinct coding conventions and somewhat different index string generation rules; the first version (Austin and Butcher 1969), however, is now considered obsolete and will not be referred to in this book.

Though not nearly so complex as MARC, the input strings in the PRECIS system still tend to seem rather daunting to the non-initiate because of the number of coding characters used. Each segment of a PRECIS input string is introduced by a nine-character code beginning with a dollar sign ("$") plus "x", "y", or "z"; a two- or three-character code also beginning with a dollar sign introduces a connective or a term after the first in a segment.  The input string is easier to grasp if each segment is written on a separate line, with { 33} the positions in the nine-character codes lined up in columns. The third column contains role codes which help to define links between terms; for example, "1" for the object of an action, "2" for an action or process, "3" for agents or factors, "p" for parts and properties, or "s" for certain types of relationships. The fourth column contains a "1" if an access term follows; otherwise, a "0". A fairly simple example is

$z11030$adocument surrogates
$zp1030$ainformation content$wof
$z20030$ameasurement$wof
$zs0030$aapplications$vof$win
$z31030$ainformation theory
for an article on "applying information theory to measuring the information in document surrogates [such as abstracts]". This description can be viewed as corresponding to a structure somewhat like
APPLICATIONS---of (s-agent)---*INFORMATION THEORY
|
in (s-object)
|
MEASUREMENT
|
of (2-object)
|
*INFORMATION CONTENT---of (whole)---*DOCUMENT SURROGATES

A PRECIS index string has three basic parts. The first two, the "lead" and the "qualifier", together form the heading; the lead is in boldface and is separated from the qualifier by a period-plus-space. The third part is a subheading, called the "display". This general pattern may be represented as

Lead. Qualifier
     Display
The lead must contain at least one term, but the qualifier or the display may be empty.

The PRECIS index string generation rules are quite complex and cannot be given here in detail. The most fundamental procedure, however, is to: 1. make the access term into the lead; 2. put the terms which precede it in the input string into the qualifier in reverse order; and 3. put the terms following it in the input string into the "display" in their original order. The first two index strings produced from the sample input string show the results of this fundamental procedure, known as "shunting"; the third shows the result of a procedure known as the "predicate transformation":

  1. Document surrogates
         Information content.  Measurement. Applications of Information theory { 34}
  2. Information content. Document surrogates
         Measurement.  Applications of information theory
  3. Information theory
         Applications in measurement of information content of document surrogates
Note here how only 2. actually has a qualifier ("Document surrogates"), while 1. and 3. have only leads and displays.

2.3.4 POPSI

The name POPSI (POstulate-based Permuted Subject Index) refers to a family of string indexing systems developed by Bhattacharyya and others (Bhattacharyya 1979) and based on the work of Ranganathan on the theory of classification. By comparison with PRECIS, the various forms of input string coding for POPSI are relatively simple. Recent theoretical discussions use numeric codes to mark segments, and punctuation such as dashes and periods to mark divisions within segments. The coding used for the most recent reported index string generator (Ravichandra Rao 1976), however, is based on the indicator system of Colon Classification. Here, a comma (",") precedes the "entity" segment; a semicolon (";"), a "property" segment; a colon (":"), a "process" segment; a hyphen ("-"), a qualifying subsegment; and a greater-than sign (">"), a narrower term. For example, a "study, using rabbits, of heart stimulation by antibiotics", assigned to the discipline of pharmacology, has the input string
PHARMACOLOGY, CHEMICAL>DRUG>ANTIBIOTIC; STIMULATION-CIRCULATORY SYSTEM>HEART: STUDY-ANIMAL>RABBIT

The rule for index string generation in the theoretical "basic" version of POPSI is an extremely simple KWOC-like one. Early implementations, however, use cycling (Bhattacharyya and Neelameghan 1969; Ravichandra Rao 1973). The most recent reported POPSI index string generator is fairly KWOC-like; but additional qualifying terms are inserted after the lead term, and generic terms are dropped in the subheading (Mahapatra 1978; Neelameghan and Gopinath 1975; Ravichandra Rao 1976). For example, index strings produced from the sample input string above are:

  1. ANIMAL,STUDY,STIMULATION
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  2. ANTIBIOTIC,PHARMACOLOGY
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT { 35}
  3. CHEMICAL,PHARMACOLOGY
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  4. CIRCULATORY SYSTEM,STIMULATION,ANTIBIOTIC
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  5. DRUG,PHARMACOLOGY
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  6. HEART,STIMULATION,ANTIBIOTIC
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  7. PHARMACOLOGY
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  8. RABBIT,STUDY,STIMULATION
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  9. STIMULATION,ANTIBIOTIC
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
  10. STUDY,STIMULATION
         PHARMACOLOGY,ANTIBIOTIC;STIMULATION -HEART:STUDY-RABBIT
Note how a lead term like "ANIMAL" or "RABBIT" from a qualifying subsegment has two terms appended to it in the heading, while other lead terms such as "ANTIBIOTIC" have only one.

In a related system, described by Gupta (Gupta 1970), input strings consist of classification codes in the Colon Classification. The software analyzes each classification code and looks up its parts in machine-readable classification schedules. The results are strung together to form a description similar to a POPSI input string; this description is then cycled to produce a set of index strings. The index entries use the original classification codes as the locators. This system seems to epitomize string indexing's advantage of much index from little input; at the same time, heavy reliance is placed on the quality of the classification scheme, the indexer's knowledge of it, and its availability in machine-readable form.

2.3.5 CASIN

The CASIN (Computer Aided Subject INdex) system (Schneider 1976) is used for indexing Food Science and Technology Abstracts, though its general approach is intended to be applicable to other fields. A CASIN input string { 36} is divided into lines with each line beginning with a two-digit "category" code. Some category codes are followed by a second category code indicating a link to a previous category. The rest of a line consists usually of a term or phrase. A special symbol, represented as "//", either precedes the main word of a phrase or stands for the main word of the phrase belonging to the linked category.  Ditto marks ('"') indicate repetition from the previous line. Some abbreviations for terms are also employed, such as "c" for "patent". For example, the CASIN input string
21 manufacture of calories low // potato products
22 potato products use for calories low // snacks
23   "         "           "    "      "         "   // fish products
32 Belgium
41 c
51 21 Calories
52 22 Potatoes
53 23   "
71 21 Potatoes
72 22 //
73 23 //
refers to a "Belgian patent for a method of manufacturing low-calorie potato products, with mention of their possible use in low-calorie snacks and fish products". A number of rules govern the forms taken by the phrases in the first part of a CASIN input string; the somewhat odd expression "calories low" above is the result of the application of one of these rules.

Each CASIN index string consists of a heading and a subheading.  Index string generation begins with one of the categories with a code greater than 50. If there is a term for this category, the index string generator places this term in the heading. If ditto marks are found instead, the heading term is taken from the previous category. If the special symbol ("//") is found, the index string generator uses the main word of the phrase belonging to the linked category. To form the first part of the subheading, the index string generator takes the phrase belonging to the linked category. It inverts this phrase if the heading category code begins with "5"; otherwise, the order of the input string is retained.  Later parts of the subheading are taken from categories with codes in the 30's and 40's. The index strings from the sample input string are thus:

  1. Calories
         potato products, manufacture of calories low, Patent, Belgium
  2. Fish products
         potato products for calories low fish products, Patent, Belgium { 37}
  3. Potatoes
         fish products, potato products use for calories low, Patent, Belgium
  4. Potatoes
         manufacture of calories low potato products, Patent, Belgium
  5. Potatoes
         snacks, potato products use for calories low, Patent, Belgium
  6. Snacks
         potato products for calories low snacks, Patent, Belgium

2.3.6 KWIDR

In KWIDR (KeyWords In Defined Rotation), a system suggested by Ekern in Norway, a human indexer adds codes to ready-made bibliographic data such as titles. Specifically, the indexer places parentheses around any parts of the input string which the index string generator is to ignore, and a four-digit code after each access term. The first two digits of the four-digit code give the ordinal number of the term within the input string, while the last two give the total number of terms in the input string. An example of a short KWIDR input string is:
Wood, NW 0105 Abstracts 0205 (and their) indexes 0305 Aslib Proc 0405 18/1966/160-166 0505

KWIDR index strings are formed basically by cycling, but with the following variations: 1. the four-digit code for the lead term is inserted after the lead term; 2. the locator is inserted after the third term of the index string; 3. the index string terminates after four terms, except that the first term of the input string must be appended if it has not yet been included. The index entries for the sample input string, where the locator is "R1", are thus:

  1. 18/1966/160-166   0505   WOOD NW ABSTRACTS   R1 INDEXES
  2. ABSTRACTS   0205   INDEXES   ASLIB PROC R1 18/1966/160-166   WOOD NW
  3. ASLIB PROC   0405   18/1966/160-166   WOOD NW   R1 ABSTRACTS
  4. INDEXES   0305   ASLIB PROC 18/1966/160-166   R1   WOOD NW
  5. WOOD NW   0105   ABSTRACTS   INDEXES   R1 ASLIB PROC

KWIDR is a somewhat mysterious system and the reasons for some of its features are obscure.

{ 38}

2.3.7 NEPHIS, LIPHIS, and NETPAD

I myself have designed three string indexing systems in attempts to overcome inadequacies of earlier systems. The first of these systems is NEPHIS (NEsted PHrase Indexing System) (Craven 1977).  The emphasis in NEPHIS is on economy and on ease for the programmer, for the indexer, and for the searcher. A NEPHIS input string is a phrase in ordinary language with added coding symbols. Only four different coding symbols are used: the left and right angular brackets ("<", ">"), the question mark ("?"), and the at sign ("@"). The "<" and ">" mark the beginning and end of a phrase embedded, or "nested", within a larger phrase. The "?" indicates that what follows is a connective to be included only in those index strings in which the connective has something to which to connect. The "@" indicates that what follows is not an access term: this coding symbol is used at the beginning of the input string or at the beginning of a nested phrase, since these positions are where the NEPHIS index string generator normally recognizes an access term.  For example, a NEPHIS input string describing an indexed item on "measures, form information theory, of the information content of document surrogates" is
@MEASURES? OF <INFORMATION CONTENT? OF <DOCUMENT SURROGATES>>? FROM <INFORMATION THEORY>

The NEPHIS index string generator creates an index string by beginning with the phrase associated with an access term. If this phrase is nested, there is one phrase in which it is most immediately nested. The index string generator basically appends to the index string the rest of this immediately larger phrase; a period plus space is inserted first unless the input string supplies a connective for this purpose. This appending process is repeated until the whole input string has been dealt with.  Some connectives and all coding symbols are omitted from the index string. Thus, from the input string above, the NEPHIS index string generator generates the index strings:

  1. DOCUMENT SURROGATES. INFORMATION CONTENT. MEASURES FROM INFORMATION THEORY
  2. INFORMATION CONTENT OF DOCUMENT SURROGATES. MEASURES FROM INFORMATION THEORY
  3. INFORMATION THEORY. MEASURES OF INFORMATION CONTENT OF DOCUMENT SURROGATES

The second system, LIPHIS (LInked PHrase Indexing System) (Craven 1978), was developed primarily to allow descriptions involving more complex structures of terms and links between terms; it also avoids the many brackets required in longer NEPHIS input strings. LIPHIS is characterized particularly { 39} by two common features of its input strings: 1. numeric codes to indicate places where the index string generator is to see links between terms which are separated in the input string; 2. the equal sign ("=") to indicate breaks in the sequence of links that the index string generator normally assumes between successive terms in the input string. Of the other coding symbols, the at sign ("@") is used, as in NEPHIS, before a non-access term, while the exclamation mark ("!") is used between words in multi-word terms. In addition, the index string generator recognizes an initial upper-case letter as indicating a term and an initial lower-case letter as indicating a connective. A LIPHIS input string for the item on "measures, from information theory, of the information content of document surrogates" is

@Measures 1 of Information!Content of      Document!Surrogates = 1 from      Information!Theory

LIPHIS divides an index string into a heading, consisting of the lead term, and a subheading. In creating an index string, the LIPHIS index string generator behaves generally quite like the hypothetical index string generator described in Chapter 1: from an initial term, it first follows as many links as possible forward, then follows a link backward, and repeats the process until there are no more links to follow. Thus, given that the sample LIPHIS input string represents the structure

MEASURES---from---*INFORMATION THEORY
|
of
|
*INFORMATION CONTENT
|
of
|
*DOCUMENT SURROGATES
it is not difficult to foresee that the index strings take the forms:
  1. Document Surrogates
         Information Content. Measures from Information Theory
  2. Information Content
         of Document Surrogates. Measures from Information Theory
  3. Information Theory
         Measures of Information Content of Document Surrogates

The third system, NETPAD (Craven 1982d, 1984; Declerck and Craven { 39} 1983), originated in work with graphical displays of the networks defined in Farradane's Relational Indexing, discussed below. There are two main versions of NETPAD, one written in MAXBASIC for the DECsystem-10 and the other in Commodore BASIC for the PET2001-8. The latter version is the more advanced and will be the one generally referred to in this book. A NETPAD input string has two parts, a table of terms and a table of links. Each term in the table of terms is numbered, and each element in the table of links consists of two term numbers and the number of the type of link.

Like MARC input strings, those for NETPAD cannot be displayed literally; they can, however, be reformated by the NETPAD software into displays readable by human beings. In the display format most similar to the input string, each term in the term list appears, preceded by its number, on a separate line. The linktype table is presented in three columns with the center column containing mnemonic characters for the linktypes. The actual mnemonics used, as well as the linktypes themselves, are the choice of the user. For instance, an input string for the indexed item on "measures, from information theory, of the information content of document surrogates" can be presented as:

# Term
1MEASURES
2INFORMATION CONTENT
3DOCUMENT SURROGATES
4INFORMATION THEORY

# Linktype #
1(2
2(3
1_4

Here, the user has chosen the right parenthesis ("(") as a mnemonic symbol for an "of" type of link and the underscore ("_") for a "from" type of link. The more usual display format presents the structure of terms and term links graphically, using the same mnemonic symbols; e.g.,
MEASURES
|/(
|  INFORMATION CONTENT
|   /(
|     DOCUMENT SURROGATES
|
/_------INFORMATION THEORY

The basic NETPAD index string generation process is again similar to that of the hypothetical index string generator described in Chapter 1. In an important difference, however, the index string generator considers a link { 39} too "weak" to follow if the weight of the associated linktype falls below a "cutoff threshold". The threshold and the linktype weights are controllable by the user. The purpose of this sort of user control is specifically to allow customizing of index displays to suit specific search needs, especially in online systems. NETPAD users can also control what connectives represent what links in index strings.

Since it is somewhat misleading to try to illustrate NETPAD output with a single set of index strings, two sets, both derived from the input string displayed above, will be used instead:

  1. with prepositions as forward connectives, with the cutoff threshold set low, and with the weights of the "of" linktype greater than those of the "from" linktype,
    1. DOCUMENT SURROGATES . INFORMATION CONTENT . MEASURES from INFORMATION THEORY
    2. INFORMATION CONTENT of DOCUMENT SURROGATES . MEASURES from INFORMATION THEORY
    3. INFORMATION THEORY . MEASURES of INFORMATION CONTENT of DOCUMENT SURROGATES
    4. MEASURES of INFORMATION CONTENT of DOCUMENT SURROGATES from INFORMATION THEORY
  2. with dashes as forward connectives, with the cutoff threshold set higher, and with the weights of the "of" linktype less than those of the "from" linktype,
    1. DOCUMENT SURROGATES
    2. INFORMATION CONTENT - DOCUMENT SURROGATES
    3. INFORMATION THEORY
    4. MEASURES - INFORMATION THEORY - INFORMATION CONTENT - DOCUMENT SURROGATES

2.3.8 Relational Indexing

Farradane's Relational Indexing (Farradane 1980a, 1980b) is not primarily a string indexing system, but a general method of indexing items by means of networks of terms and links. Farradane sees it as most useful in computerized matching of descriptions to searchers' specifications. Nevertheless, the Relational Indexing string indexing software (Farradane { 42} 1978; Farradane and Gulutzan 1977) does lead to usable string index displays and has a number of interesting features.

An input string for Relational Indexing in general is divided into lines, and elements within each line are separated by semicolons (";"); each element begins with a code consisting of a letter followed by an equal sign ("="). As it is for NETPAD, the basic input string is a table of terms and a table of links. The code "s=" precedes each term in the term list; the code "w=" precedes each term number and "r=" each linktype number in the link table. Examples of linktype numbers are: 3 ("distinctness", "/)"), usually meaning "having as substitute"; 6, ("action", "/-"), meaning "affected by"; 7 ("association", "/;"), indicating various types of relationship; and 9 ("functional dependence", "/:"), meaning "yielding".

For the specific purposes of the index string generator, information added in the input string may include: whether terms represent processes ("v=2") or other entities ("v=1"); with which pairs of terms to begin index strings ("l=1"); how to express links in special cases ("p=" plus a connective); and special additional links ("a=" plus a special linktype number and a term number; "g=" plus a link number).

Using the the Relational Indexing system to describe an item on "the use of measures from information theory and coding theory in the measurement of the semantic information content of documents and their surrogates", an indexer creates the input string

v=1;s=documents
v=1;s=surrogates
v=1;s=information content/semantic
v=2;s=measuring
v=1;s=measures
v=1;s=information theory
v=1;s=coding theory
l=1;w=1;r=3;p=of;l=1;w=2
l=1;w=1;a=12;r=7;l=1;w=3;g=4
l=1;w=2;a=11;r=7;w=3
g=2;w=3;r=6;w=4
w=4;r=7;p=for;w=5
l=1;w=6;a=17;p=derived;r=9;p=derived from;w=5
l=1;w=7;a=16;p=derived;r=9;p=derived from;w=5

An index string in this system has two parts: a one-term heading and a subheading. The index string generation process basically involves following links from term to term, beginning with an access term. Special rules must be invoked where more than one sequence of links could be followed; for instance, except for the fact that the link in which the lead term is marked is always followed first, special additional links take precedence over ordinary links. { 43} The index strings from the sample input string will serve as an illustration of the results obtained:

  1. Coding theory
         derived measures for measuring semantic information content of documents and surrogates. Information theory and -,
  2. Documents
         information content, semantic, measuring by measures derived from information theory and coding theory. Surrogates and -,
  3. Documents
         surrogates semantic information content measuring by measures derived from information theory and coding theory.
  4. Information content
         semantic, of documents and surrogates measuring by measures derived from information theory and coding theory.
  5. Information theory
         derived measures for measuring semantic information content of documents and surrogates. Coding theory and -,
  6. Surrogates
         of documents. Semantic information content measuring by measures derived from information theory and coding theory.
  7. Surrogates
         information content, semantic, measuring by measures derived from information theory and coding theory. Documents and -,

2.3.9 CIFT

CIFT (Contextual Indexing and Faceted Taxonomic Access System) is a special indexing system designed for the MLA International Bibliography for documents in language, literature, and folklore (Anderson 1979; Mackesy 1981; Modern Language Association 1981, 1982). A similar approach can, however, be applied to other areas such as art (mutrux and Anderson 1983). Segments of CIFT input strings are divided by one or more spaces. Two types of role codes are used: 1. a two-letter "facet" code, which, followed by a numeral and a slash ("/"), begins each segment; 2. a three-letter "role" code, which, preceded by a left angular bracket ("<"), is an abbreviation for a common connective phrase and precedes certain terms. Examples of CIFT { 44} facet codes are: "yl", "specific literatures"; "ta", "periods"; "ra", "individuals (real)"; "pa", "genres"; "na", works"; "lk", "literary techniques"; "ka", "sources"; and "ha", methodological approaches. "Role" codes include "uso", meaning "use of", and "soi", meaning "sources in". One or two asterisks ("*", "**") precede an access term, the double asterisk indicating that the term is a personal name. For example, a possible input string for an item taking "a linguistic approach to Virgilian sources for the use of hendiadys in Shakespeare's Hamlet" is:
yl1/English literature  ta1/1500-1599
ra1/Shakespeare, William  pa1/Tragedy
na1/Hamlet  lk1/<uso*Hendiadys
ka1/<soi**Virgil ha1/*Linguistic approach

As produced for the MLA International Bibliography, a CIFT index string has three parts: a heading, to be displayed in boldface capitals; a subheading, in mixed upper-and-lower-case boldface; and a subsubheading, in typefaces of ordinary weight. The index string generation rules are KWOC-like, with some exceptions, such as: certain parts of the input string are standardized automatically by reference to a thesaurus; the codes and abbreviations are translated to read more naturally; terms preceded by certain "role" codes, such as "soi", have some indication of these codes appended when they are lead terms; a lead term repeated in the subheading is capitalized; author dates appear only in the heading. For example, the index strings from the sample input string are:

  1. HENDIADYS
        English literature. Tragedy. 1500-1599.
            Shakespeare, William.  Hamlet. Use of HENDIADYS. Sources in Virgil. Linguistic approach.
  2. LINGUISTIC APPROACH
        English literature. Tragedy. 1500-1599.
            Shakespeare, William. Hamlet. Use of Hendiadys. Sources in Virgil. LINGUISTIC APPROACH.
  3. VIRGIL (70-19 B.C.) - AS SOURCE
        English literature. Tragedy. 1500-1599.
            Shakespeare, William.  Hamlet. Use of Hendiadys. Sources in VIRGIL. Linguistic Approach.
Access to the indexed item via "Shakespeare, William" is provided indirectly { 45} by a reference to the classified section of the bibliography; e.g.,
SHAKESPEARE, WILLIAM (1564-1616)
See also classified section: I 1831 ff.

2.3.10 The Iowa State University system

Mischo's system developed at Iowa State University (Mischo 1979, 1980), is, like KWIDR, basically a system in which the indexer can add coding to existing bibliographic data or descriptions in ordinary language. It has been used, for example, both on titles and Library of Congress Subject Headings and on descriptive phrases. The indexer generally separates terms in the input string by one-character codes, which often have more than one function.  For instance, both the octothorpe ("#") and the plus sign ("+") precede access terms.  They also indicate what punctuation should separate the following term from the preceding term in index strings in which the terms are in the same order as in the input string: the octothorpe, a space only; the plus sign, a dash. An inequality sign ("≠") is occasionally used to make the following term a non-access term. Alternative portions are enclosed in parentheses ("(", ")") and separated by semicolons (";"). A fairly simple example of an input string is
GERMAN-ENGLISH # PHYSICS # DICTIONARY

The basic index string generation process in this system is that of cycling. When more than two terms are involved, however, two additional index strings are normally generated in the forms: "first term - last term - middle"; and "last term - middle - first term". Thus the input string above yields five index strings:

  1. DICTIONARY - GERMAN-ENGLISH PHYSICS
  2. DICTIONARY - PHYSICS - GERMAN-ENGLISH
  3. GERMAN-ENGLISH DICTIONARY - PHYSICS
  4. GERMAN-ENGLISH PHYSICS DICTIONARY
  5. PHYSICS DICTIONARY - GERMAN-ENGLISH

The Iowa State University index string generator, like that for Double-KWIC, actually has two stages. The first stage produces one or more intermediate strings making explicit any alternatives indicated by parentheses and semicolons. For example, from the input string

(CIVIL ENGINEERING) # (TABLES; ≠ HANDBOOK)
describing a "handbook of civil engineering, with tables", the intermediate strings produced are:
  1. CIVIL ENGINEERING # TABLES
  2. CIVIL ENGINEERING HANDBOOK
{ 46} The second stage then uses the intermediate strings to generate the actual index strings; e.g.,
  1. CIVIL ENGINEERING TABLES
  2. CIVIL ENGINEERING HANDBOOK
  3. TABLES - CIVIL ENGINEERING

Like PERMUTERM and SLIC, The Iowa State University system emphasizes number of index strings. For example, the input string

(ENERGY; COAL; PETROLEUM; NUCLEAR ENERGY; SOLAR ENERGY; GEOTHERMAL ENERGY; GASOLINE; HYDROPOWER; POWER ENGINEERING; INDUSTRIAL ENGINEERING) # (PHYSICAL CONSTANTS; ≠ HANDBOOK; DICTIONARY)
assigned to Energy technology handbook, yields no fewer than 50 index strings. On the other hand, the index strings are usually short. Thus, the original Mischo index to the Iowa State University Library Reference Collection truncates all index strings at 38 characters, but only about 5% lose any characters in this way.

2.3.11 PERMDEX

PERMDEX (Yerkey 1983) is a much simplified version of PRECIS written for the TRS80 model III microcomputer. Terms in the input string are followed by dollar signs ("$") and preceded by three-character codes.  Each three-character code consists of: 1. a role code; 2. a code indicating whether an access point follows ("1" if it does); and 3. a code indicating whether the term is to be included in the "display" line ("3" for "yes"). Although based on PRECIS, PERMDEX uses a rather different set of role codes. For instance, although "2" indicates the main action, as it does in PRECIS, it is "3" rather than "1" which indicates the object of the action; likewise, "4" rather than "6" indicates the form of the indexed item, and all adjectives are coded "M".  An example is the input string for "statistics on accidents involving small foreign automobiles":
213ACCIDENTS$M03SMALL$M13FOREIGN$
313AUTOMOBILES$413STATISTICS$

Unlike most string indexing systems, PERMDEX does not regularly put the locator at the end of the index entry, but inserts it between the heading and the subheading. For example, where the locator is "1234", the index entries produced from the input string above are:

  1. ACCIDENTS. 1234.
         SMALL FOREIGN AUTOMOBILES. STATISTICS. { 47}
  2. AUTOMOBILES (SMALL FOREIGN). ACCIDENTS. 1234.
         STATISTICS.
  3. FOREIGN AUTOMOBILES (SMALL). ACCIDENTS. 1234.
         STATISTICS.
  4. SMALL FOREIGN AUTOMOBILES. ACCIDENTS. 1234.
         STATISTICS.
  5. STATISTICS. AUTOMOBILES (SMALL FOREIGN). ACCIDENTS. 1234.

2.3.12 PASI

PASI (Pragmatic Approach to Subject Indexing) (Dutta and Sinha 1984) is a simple system developed in India for the Sorghum and Millets Information Center of the International Crops Research Institute for the Semi-Arid Tropics. PASI input strings are divided into segments by commas (","), and terms within multiterm segments are separated by colons (":") or slashes ("/"). Non-access terms are preceded by asterisks ("*"), and prepositions added for clarity are enclosed in parentheses ("(", ")"). For example, an article entitled "Mutagenic effects of combination treatments of hydrazine, ethyl methanesulphonate and gamma rays in Sorghum bicolor (L.) Moench" is assigned the input string
Sorghum bicolor, Mutation, Hydrazine: Ethyl methanesulphonate: Gamma rays, *Effect

The PASI index string generator produces the index strings basically by cycling the segments, with three variations: 1. no term preceded by an asterisk in the input string is made a lead term; 2. when a multiterm segment is cycled into the lead position, the terms within it are cycled in turn; 3. when a segment beginning with a parenthesized preposition is cycled into the lead position, the parenthesized preposition is omitted. Thus, the index strings generated from the input string above are:

  1. Ethyl methanesulphonate: Gamma rays: Hydrazine,
         Effect; Sorghum bicolor, Mutation,
  2. Gamma rays: Hydrazine: Ethyl methanesulphonate,
         Effect; Sorghum bicolor, Mutation,
  3. Hydrazine: Ethyl methanesulphonate: Gamma rays,
         Effect; Sorghum bicolor, Mutation,
  4. Mutation,
         Ethyl methanesulphonate: Gamma rays: Hydrazine, Effect; Sorghum bicolor { 48}
  5. Sorghum bicolor,
         Mutation; Ethyl methanesulphonate: Gamma rays: Hydrazine, Effect

2.3.13 The NILS system

A simple string indexing system has been developed by the National Insurance Law Service as part of a larger project to computerize its database. A NILS input string is divided into segments by dashes (" - "). The first segment is always a numeral specifying a particular procedure for index string generation, and the remaining two or three segments are always terms. A "1" or a "5" in the initial segment indicates that only the first term is an access term; a "2" or a "4", that the first two terms are access terms; a "3", that all three terms are access terms; and a "6", that the first and third terms are access terms. For example, an article relating to "costs of, and reserves needed for, self insurance for workers compensation" is assigned the input strings
3 - WORKERS COMPENSATION INSURANCE - SELF INSURANCE - COSTS
and
5 - RESERVES - WORKERS COMPENSATION - SELF INSURANCE

The basic rule for index string generation in the NILS system is to make every possible permutation of the access terms in the input string while leaving non-access terms unmoved. Thus, the index strings resulting from the first input string above are:

  1. COSTS - SELF INSURANCE - WORKERS COMPENSATION INSURANCE
  2. COSTS - WORKERS COMPENSATION INSURANCE - SELF INSURANCE
  3. SELF INSURANCE - COSTS - WORKERS COMPENSATION INSURANCE
  4. SELF INSURANCE - WORKERS COMPENSATION INSURANCE - COSTS
  5. WORKERS COMPENSATION INSURANCE - COSTS - SELF INSURANCE
  6. WORKERS COMPENSATION INSURANCE - SELF INSURANCE - COSTS
The "5" code has the effect of producing an additional index string with the order of the two non-access terms reversed. Thus, the index strings resulting from the second input string above are: { 49}
  1. RESERVES - SELF INSURANCE - WORKERS COMPENSATION
  2. RESERVES - WORKERS COMPENSATION - SELF INSURANCE
Only index string 2 would be produced if the code were "1".

2.4 NEAR-STRING-INDEXING SYSTEMS

Many indexing systems are not string indexing systems, and all cannot be discussed here. This section is devoted to a very few index display methods which are especially similar to string indexing and which have had a significant influence on the design of string indexing systems.

2.4.1 Title catchword systems

Long before true, if relatively simple, string indexing systems such as KWIC and KWOC, indexes were being produced by manually manipulating the titles of the indexed items in various ways.  An early example is provided by the index to Watt's Bibliotheca Britannica (Watt 1824), which rather resembles a KWOC index. The major parts of an entry are: a lead term, which is most often a keyword from the title; the date; the title, with the repeated lead term abbreviated; and the locator. For example, entries for the title "Peace, Ignominy, and Destruction; a Poem" are:
  1. DESTRUCTION. - 1796. Peace, Ignominy, and D.; a Poem. 546 o
  2. IGNOMINY. - 1796. Peace, I., and Destruction; a Poem. 546 o
  3. PEACE. - 1796. P., Ignominy, and Destruction; a Poem. 546 o
Cross-references and notes are also included; e.g.,
PEACE, rest, quiet

2.4.2 Cross-indexing

Many essentially manually produced indexes, especially back-of-book indexes, show the multiple overlapping entries characteristic of string indexes. Again, however, the entries are not generated by computer software according to explicit syntactic rules. Indeed, inconsistencies can often be noted. For { 50} example, in one such index (Allwood and others 1977, pp. 181-185), the index entry
logic: epistemic 112
has a slightly different locator from its complement
epistemic logic 112f
Similarly, the index entries
modal predicate logic 110
and
predicate logic: modal 110
are not properly complemented by the index entries starting with "logic"
logic: modal 108f
and
logic: predicate 58f, 148f
which give no indication to a searcher that information on a kind of predicate logic is available on page 110.

Multiple overlapping entries are often possible in using lists of subject headings such as the Library of Congress Subject Headings. Again, their generation is not by computer software. Indexers may, however, sometimes be guided to generate multiple overlapping entries according to explicit rules. For example, in the Library of Congress system, an item assigned a subject heading in the form "Country 1 - Relations (Military) with Country 2" must also be assigned one in the form "Country 2 - Relations (Military) with Country 1"; e.g.,

  1. Salvador - Relations (Military) with United States
  2. United States - Relations (Military) with Salvador

2.4.3 Kaiser's Systematic Indexing

If it were computerized, Kaiser's Systematic Indexing (Kaiser 1911) could probably be considered the earliest clear example of string indexing. As it is, it was designed and used for manual card indexing of business information. Its syntactic rules, however, are quite explicit and are designed for multiple overlapping entries. { 51}

The first part of a Systematic Indexing index entry, the "statement", consists usually of two to three terms. The second part, the "amplification", can be quite long, containing a sort of abstract, or "extension", as well as: date of information; author(s); name of publication, place and date, pagination, edition; and a locator in the form of a call number. A sample "statement" and "extension", for an item on "60-80% increase of paper prices in India due to scarcity", are

(statement) PAPER
    INDIA
        DEMAND
(extension) Prices have advanced 60-80% owing to scarcity.

A statement must contain one "process" term; in addition, it must contain either a "concrete" term or one or two "country" (geographical location) terms, or both. Concrete terms and country terms are always access terms; process terms never are. For example, an item on "the export of electric traction motors from Italy to France" is assigned: the process term "EXPORT TRADE"; the concrete term "ELECTRIC TRACTION MOTOR"; and the country terms "ITALY" and "FRANCE". The statements in the corresponding entries are:

  1. ELECTRIC TRACTION MOTOR
        ITALY-FRANCE
            EXPORT TRADE
  2. FRANCE-ITALY
        ELECTRIC TRACTION MOTOR
            EXPORT TRADE
  3. ITALY-FRANCE
        ELECTRIC TRACTION MOTOR
            EXPORT TRADE

2.4.4 Unit card systems

The popularity of unit card systems dates from the issuing of printed catalog cards by the Library of Congress at the beginning of the century. They are clearly the precomputer forerunners of the automated library catalog systems already described. A master, or "main entry", card is the equivalent of the input string. An unmodified copy of the master card becomes the "main" entry. Part of the information on the master card is the "tracings", which constitute instructions on what "added" entries to make. A tracing may either be an access term or refer, by a name such as "title" or "series", to an access term elsewhere on the card. An added entry is constructed by copying the master card and typing, at the top of the copy, the access term indicated by the tracing.
{ 52}

2.4.5 The Universal Decimal Classification

The Universal Decimal Classification is rather unusual among traditional, library-style classification schemes in that it is explicitly designed for multiple index entry generation. It does not, however, actually include a specific index string generator. Either cycling or a modified KWOC manipulation are suggested by the documentation (Mills 1963, pp. 44-45). The terms in this system are, of course, codes rather than words or phrases in ordinary language.

2.4.6 Chain procedures

Chain indexing (Ranganathan 1964, pp. 279-326), though based in part on the "Relativ Index" to the Dewey Decimal Classification, was originated as a systematic procedure by Ranganathan. In its original form, it requires that the locators in the index point to locations in a classified sequence. The term "chain indexing" comes from the use of the term "chain" to mean a sequence of classes each of which includes any which follow it: the classes to which a classified sequence assigns an item necessarily form such a chain.  For example, an item consisting of "statistics on rural education in India in the 1930's" is assigned, in the Colon Classification, to the smallest class "T9(Y31).44'N3s" and hence by implication to each of the classes in the chain:
T education
T9(Y31) rural education
T9(Y31).4 rural education in Asia
T9(Y31).44 rural education in India
T9(Y31).44'N3 rural education in India in the 1930's
T9(Y31).44'N3s statistics on rural education in India in the 1930's.

The compiler of a chain index sees to it that there are index entries for various broader and narrower classes to which the classification scheme assigns each item in the sequence. A chain index avoids waste by starting the index entry for each class in a chain with a different term; for this reason, some classes in the chain may in fact have no index entries.  For example, a chain index to a sequence classified by the Colon Classification would contain the following entries for the sample item above:

  1. ASIA, RURAL, EDUCATION ...T9(Y31).4
  2. EDUCATION ...T
  3. INDIA, RURAL, EDUCATION ...T9(Y31).44 { 53}
  4. RURAL, EDUCATION ...T9(Y31)
  5. STATISTICS, INDIA, RURAL, EDUCATION ...T9(Y31).44'N3s
The chain indexer makes no entry for the class "T9(Y31).44'N3", "rural education in India in the 1930's": "1930's" is assumed not to be a useful access term, and other access terms are already covered by other entries.

The most common procedure in chain index entry generation, exemplified by all but the first of the index entries above, is quite similar to that for TABLEDEX: starting with an ordered list of terms and successively removing terms from the beginning until no terms remain. Chain indexing nevertheless shows two important differences: 1. the terms are ordered, not alphabetically, but in an order corresponding to the reverse of the chain; 2. the locator is changed with each term dropped from the term list.

A newly classified document may result in several new index entries in a traditional chain index; these index entries do not, however, point directly to the document itself, but each to a different class of documents. Moreover, entry generation is not an automated process. Thus, traditional chain indexing is also not a form of string indexing according to the definition given in this book. Its importance here lies in its influence on the design of POPSI and PRECIS, both directly and through Coates' adaptation.

Coates' chain procedure (Coates 1960, 1969; Coates and Nicholson 1967), adopted for British Technology Index, the predecessor of Current Technology Index, is based on that of Ranganathan. But Coates modified Ranganathan's method for use without a classification scheme or a classified sequence. Coates' work is especially important because of its explicit rules and because of its computerization. As in string indexing, the indexer constructs input strings which the software uses to construct the index. Unlike string indexing, only one index entry normally results from each input string; however, one or more cross-references are also produced. For example, for an item on "the turbulent flow of water in pipes", the description part of the single index entry is

WATER:Flow,Turbulent:Pipes
From the same input string, however, the software also produces the cross-references
  1. FLOW,Turbulent:Water See WATER:Flow,Turbulent
  2. PIPES:Turbulent flow:Water See WATER:Flow,Turbulent:Pipes
  3. TURBULENT FLOW:Water See WATER:Flow,Turbulent
{ 54}

2.3.7 Universal Index Entry Generator

Keen's Universal Index Entry Generator (TOPSI-UNIV) (Armstrong and others 1983) is a demonstration index string generator associated with the Teaching Of Printed Subject Indexes (TOPSI) project.  It is included here rather than in the previous section because it is not clear that it represents a complete string indexing system.

The TOPSI-UNIV indexer first types in every word or number that is to appear in any index entry for the indexed item. The TOPSI-UNIV program assigns a one-letter code to each word or number typed in. The indexer then types in a specification for each index string to be produced for the indexed item. Each specification is identical with the corresponding index string except that: 1. each word or number is represented by its one-letter code; 2. the beginning of a subheading is indicated by a plus sign ("+"). Finally, the indexer types "GO" to indicate the end of the input string.

TOPSI-UNIV can imitate the index strings of almost any string indexing system. For example, an indexer wishing to obtain PRECIS-like index strings for the item on "applying information theory to measuring the information in document surrogates" could first type

APPLICATIONS OF INFORMATION THEORY TO MEASUREMENT OF INFORMATION CONTENT OF DOCUMENT SURROGATES
The TOPSI-UNIV program would code this phrase as
AAPPLICATIONS BOF CINFORMATION DTHEORY EIN FMEASUREMENT GOF HINFORMATION ICONTENT JOF KDOCUMENT LSURROGATES
and the indexer could then enter the specifications
KL+HI. F. ABCD
HI. KL+F. ABCD
CD+AEFGHIJKL
GO
The resulting index strings would be:
  1. DOCUMENT SURROGATES
      INFORMATION CONTENT. MEASUREMENT. APPLICATIONS OF INFORMATION THEORY
  2. INFORMATION CONTENT. DOCUMENT SURROGATES
      MEASUREMENT. APPLICATIONS OF INFORMATION THEORY { 55}
  3. INFORMATION THEORY
      APPLICATIONS IN MEASUREMENT OF INFORMATION CONTENT OF DOCUMENT SURROGATES

Chapter 2 Summary

String indexing systems can be divided roughly into three categories according to type of input string: phrases in ordinary language such as titles; simple lists of terms; and strings containing additional codes as instructions to the software.

String indexing systems with ordinary-language input strings include cycling and KWIC systems, KWOC systems, PANDEX, PERMUTERM, Double-KWIC, ASI, and KWPSI. A cycled index string consists of a lead term, the part of the input string following the lead term, a dividing symbol, and the part of the input string preceding the lead term; KWIC is a common way of displaying the output of the cycling process. A KWOC index string consists of a lead term plus the unmodified input string; in a variant, an omission symbol is substituted for the repeated lead term. PANDEX is an elaboration of KWOC. In PERMUTERM, each index string consists of two terms, and a number of index strings from one input string may share the same lead term. The latter is also true of Double-KWIC, which is a combination of cycling and variant KWOC. ASI and KWPSI represent somewhat more complex manipulation of input strings, involving segmenting at different levels.

String index systems with term-list input strings include the CLASE system, ABC-Spindex, TABLEDEX, SLIC, and, somewhat marginally, MULTITERM. The CLASE system is much like PERMUTERM. ABC-Spindex, TABLEDEX, and SLIC all alphabetize the input list of terms; index string generation is KWOC-like in ABC-Spindex and involves selective term omission in TABLEDEX and SLIC. MULTITERM appends role codes to terms and has a cycling type of index string generator.

String indexing systems with coded input strings include Statement Indexing, automated library catalog display systems, PRECIS, POPSI, CASIN, KWIDR, NEPHIS, LIPHIS, NETPAD, Relational Indexing, CIFT, the Iowa State University system, PERMDEX, PASI, and the NILS system. Statement Indexing is an early proposal involving a somewhat KWOC-like index string generator. Automated library catalog display systems are included here for completeness but cannot be discussed in detail. The well-known PRECIS system requires a nine-character code for each segment of the input string; several procedures are applied in index string generation, of which { 56} the chief is shunting. POPSI employs a simpler form of input string coding; the main procedure in index string generation is KWOC-like, but with omissions and with one or two terms inserted after the lead term. CASIN specifies access terms separately from other information in the input string; an access term is associated with a phrase earlier in the input string that will be used to generate the beginning of the subheading. KWIDR is a rather mysterious, though simple, system in which coding is added to ready-made data. NEPHIS make use of the idea of the nesting of phrases in the input string, producing index strings similar to those from ASI, but with fewer connectives. LIPHIS, NETPAD, and Relational Indexing represent moves toward more explicit representation of networks of term links. CIFT is specially designed for documents on language, literature, and folklore; index string generation is KWOC-like. The Iowa State University system emphasizes the production of many relatively short index strings. PERMDEX, PERMDEX, and the NILS system are all quite simple: PERMDEX is derived from PRECIS; PASI uses cycling rather than shunting.

A few indexing methods very close to string indexing are manual title catchword systems, various forms of manual cross-indexing, Kaiser's Systematic Indexing, unit card systems, the Universal Decimal Classification, chain procedures, and the Universal Index Entry Generator. Manual systems often show various degrees of inconsistency. Kaiser's system, on the other hand, is a string indexing system in all but computerization. Unit card systems are the ancestors of automated library catalog display systems. The Universal Decimal Classification and the Universal Index Entry Generator are pieces of string indexing systems rather than wholes. The chain procedures of Ranganathan and Coates are important influences on the later development of string indexing systems.

<-- Chapter 1: Introduction Contents Chapter 3: Input -->