String indexing systems will be divided roughly into three categories according to type of input string: phrases in ordinary language; simple lists of terms; and strings containing additional codes as instructions to the software. This division has been possible because, though string index generation without input strings can be envisioned, all present-day string indexing systems do have input strings.
A final section will give some background on other indexing methods which do not quite fit the definition of string indexing used in this book. These close relatives of string indexing are included here for purposes of comparison and because of their influence on the design of systems which do fit the definition.
The software of these ordinary-language
systems analyzes input strings more or less crudely
into their components. It also more or less
recognizes the components as connectives, access
terms, and so on. In doing so, it may make
reference to such characteristics as their length
or to various types of ancillary input, such as a
stoplist or a golist. A stoplist is a list of terms
which cannot be access terms; a golist is a list of
terms which should be access terms. The access
terms recognized are usually individual words and
these words are referred to as keywords.
EXTRUSION TEXTURIZING OF SOY MEALa typical cycled index generator produces the four index strings:
The syntactic rules that define what index strings the cycled index string generator produces can be stated very briefly. A cycled index string consists of a keyword from the input string, the part of the input string following the keyword, a dividing symbol, and the part of the input string preceding the keyword. If no part of the input string happens to follow the keyword, as in index string 1 above, or to precede the keyword, as in 2, the index string will of course look somewhat simpler, since one part of it will be empty.
In classic KWIC, the basic output of the cycling process is displayed so that the part of the title preceding the keyword continues to precede the keyword in the displayed index entry. In the simplest form, parts of the index string which would not fit onto a single line are omitted. Thus, the sample index strings might appear as:
Truncation is not usually a severe problem in practical KWIC index displays, where the space allowed for an index string is normally at least 60 characters, and often over 100. Allowing too much space for KWIC index string display creates its own problems, however, those of too much empty, or "white", space.
The first disadvantage of excessive white space is of course a bulkier index. A second problem, however, is created because the locators in a KWIC index generally appear in a separate column on the righthand side of the display. If there is too much empty space between the ends of the index strings and the corresponding locators, searchers may find it quite difficult to match the correct locator with an index string (Fischer 1966).
One response to the white-space and truncation problems has been to "wrap around" the KWIC index strings. While there are different versions of wrap-around KWIC, a typical effect, provided the input string is not too long, is of cycled index strings with their last parts chopped off and reattached in front; e.g.,
Another solution to the locator-matching
problem, the filling out of the empty spaces with
dots, does not reduce the bulk of the index
(Chernyi and others 1969). More recent suggested
variants allow multi-line index strings, as in the
KWIC-style experimental library catalog tested at
the Bath University Centre for Catalogue Research
(Prowse 1983, pp. 7-9).
In a variation of KWOC, a mark of omission (for example, an asterisk, "*") is substituted for the repeated access term:
Truncation is unusual in KWOC systems, and the
index string is often displayed over several lines.
One exception is a system used by the MITRE
Corporation, which truncates both longer lead terms
and titles (Feinberg 1973, pp. 119-122).
The results of the two features can be seen in the following short extract { 22} from a PANDEX index (Feinberg 1973, p. 145), which also truncates the subheading part of each index string to fit a standard column size:
ADSORPTION
of polyglutamic acid adsorbed on char
changes induced by alkali adsorption
methods. IV. Adsorption of aromatic
Adsorption characteristics of water so-
Chromatographic adsorption constants
in adsorption chromatography on alum-
A later modification (Lay 1973) allows the
index producer to specify a mixture of KWOC and
Double-KWIC. Double-KWIC is followed for access
terms which apply to more than a given number of
indexed items, while for rarer lead terms KWOC is
adopted.
{ 24}
Relatively simple input strings can be segmented in ASI in only one way; for example, "EXTRUSION TEXTURIZING OF SOY MEAL" can be segmented only as follows:
EXTRUSION TEXTURIZING OF SOY MEALSuch input strings thus always yield the same index strings. For more complex input strings, the ASI program must choose which of several "variant" index strings to produce for a given access term; the basis for the choice is how many index strings beginning in the same way can be produced for other indexed items. The aim is to improve collocation.
Take, for example, the subject indexes to CLASE (Citas Latinoamericanas en Sociología, Economía y Humanidades). These indexes are produced by a PERMUTERM-like index string generator from lists of subject headings assigned to each indexed item. Because, however, the subject headings are quite often two or three words long, more space is allowed for them in the index strings than is allowed for terms in ISI's PERMUTERM; thus, truncation is avoided. For instance, an article entitled "La Administración de las Provincias Senatoriales Romanas", assigned the term list
A list of terms can be given a simple, easily predicted order by being alphabetized. The ABC-Spindex system (Falk and Baser 1980), by which ABC-Clio indexes America: History and Life, uses such a list of alphabetized keywords, applying a modified KWOC procedure to produce the index strings. Because the terms are alphabetized and unconnected, no symbol of omission is required. For example, an indexed item dealing with "Thomas Allen and the American Revolution in New England from 1774 to 1777" is assigned the term list
Two systems, TABLEDEX (Ledley 1958) and SLIC (Sharp 1966) retain strict alphabetical order in every index string by limiting the terms which follow the lead term to ones which file after it.
The TABLEDEX index string generator produces one index string for each access term in the input string, omitting only the terms which precede the access term. Thus, an article assigned the term list
The SLIC (Selected Listing In Combination) string index generator, by contrast, like PERMUTERM and Double-KWIC, routinely produces more than one index string for a single access term in the input string. For example, the index strings produced for the term list
MULTITERM (Skolnik 1970, 1972) is also discussed in this section, even though it is somewhat difficult to view the elements of its input strings as consisting just of terms. Each input string element does consist mostly of a term of one or more words; but, in addition, this term may have appended to it a space-plus-hyphen (" -") followed by a one- or two-letter role code. In general, a role code is a code indicating how a term is related to the rest of a description. Examples in MULTITERM are: "Q", indicating that the term names a "property" or "quality" of something mentioned earlier in the input string; "D", for "determination"; and "U", for "use". Other string indexing systems employ role codes which are more clearly separate from the terms and which help give instructions to the software. { 28}
MULTITERM terms, with their role codes where appropriate, are separated by slashes ("/"); a double slash ("//") marks the end of the input string. For instance, the MULTITERM input string for a document on "the study of the structure of graphite fibers using an X-ray method" is
Fiber:Graphite -Q/Structure -D/Test Method -U/X-Ray -U//The MULTITERM software produces the index strings by simply cycling the elements marked off by the slashes:
Rumen .. contents .. affected - by hay /f. alfalfa - in cattle
Given that each dash in the input string identifies an access term, index string generation in Statement Indexing is basically KWOC-like. Two major differences should be noted, however. These differences can be illustrated by the index strings corresponding to the sample input string above:
Access via terms introduced by slashes in the input string is provided by cross-references; e.g.,
Alfalfa - hay see Hay/f. alfalfa
Many library catalog systems apply some version of the MARC (MAchine Readable Catalog) standard to their input strings. The indexer, or cataloger, does not have to construct, or even see, the input string in its raw MARC format, which is more or less incomprehensible except to experts. Nevertheless, catalogers must often know and employ many of the MARC codes in their input.
Associated with most of the larger segments, or fields, of a MARC input string are three-digit codes identifying the aspect of the indexed item being described. Examples are: "100", "main entry", usually the name of the { 30} principal author; "245", title; "250", edition; "260", publication details; "300", physical description; "504", note on bibliographies and indexes contained; "650", "topical" subject.
Smaller segments, or subfields, may be marked with two-character codes such as "‡b", "‡c", and "‡2", whose meaning varies depending on where they occur (the first character, variously displayed in practice, is represented here by the double dagger, "‡"). Other codes appear in fixed positions near the beginning of the input string or the beginnings of fields. For example, "10" at the beginning of the "100" field signifies "single surname, not a subject"; "14" at the beginning of the "245" field, "access term, first 4 characters disregarded in sorting"; "0" at the beginning of the "260" field, "publisher or the like named"; " 0" at the beginning of a "650" field, "not specified as primary or secondary, Library of Congress subject heading".
Much of the information in the first part of a MARC input string is both heavily coded and not necessary for the production of index strings. Thus, the following sample display is only of information from the later part:
100 10 Tuchman, Barbara Wertheim.(OCLC 1984, p. Intro:2).
245 14 The march of folly : ‡b from Troy to Vietnam / ‡c Barbara W. Tuchman
250 1st ed.
260 0 New York : ‡b Knopf : Distributed by Random House, ‡c 1984.
300 xiv, 447 p., [32] p. of plates : ‡b ill. (some col.) ; ‡c 24 cm.
504 Includes bibliographies and index.
650 0 History, Modern
650 0 History ‡x Errors, inventions, etc.
650 0 Power (Social sciences)
650 0 Judgment.
MARC is not in itself a string indexing system; it is not even a complete system for input strings. Catalogers also consult: 1. an extensive set of "descriptive" cataloging rules on how aspects of the indexed item other than its subjects are to be described; 2. a list of subject headings; 3. at least one library classification scheme. The second edition of the Anglo-American Cataloging Rules is the common present-day standard for "descriptive" cataloging. For the other aspects of the description, which relate mainly to subject matter, the Library of Congress Subject Headings and either the Dewey Decimal Classification or the Library of Congress Classification are widely used.
After these extensive instructions for input strings, index string generation is relatively simple. The index string generation process is similar to that for { 31} Statement Indexing, which in turn was probably influenced by earlier manual library catalog practice. The main difference is in the MARC codes, which all drop out, with a small amount of formating and numbering taking their place. For example, typical index strings for the sample input string partly displayed above are:
Though not nearly so complex as MARC, the input strings in the PRECIS system still tend to seem rather daunting to the non-initiate because of the number of coding characters used. Each segment of a PRECIS input string is introduced by a nine-character code beginning with a dollar sign ("$") plus "x", "y", or "z"; a two- or three-character code also beginning with a dollar sign introduces a connective or a term after the first in a segment. The input string is easier to grasp if each segment is written on a separate line, with { 33} the positions in the nine-character codes lined up in columns. The third column contains role codes which help to define links between terms; for example, "1" for the object of an action, "2" for an action or process, "3" for agents or factors, "p" for parts and properties, or "s" for certain types of relationships. The fourth column contains a "1" if an access term follows; otherwise, a "0". A fairly simple example is
$z11030$adocument surrogatesfor an article on "applying information theory to measuring the information in document surrogates [such as abstracts]". This description can be viewed as corresponding to a structure somewhat like
$zp1030$ainformation content$wof
$z20030$ameasurement$wof
$zs0030$aapplications$vof$win
$z31030$ainformation theory
APPLICATIONS---of (s-agent)---*INFORMATION THEORY
|
in (s-object)
|
MEASUREMENT
|
of (2-object)
|
*INFORMATION CONTENT---of (whole)---*DOCUMENT SURROGATES
A PRECIS index string has three basic parts. The first two, the "lead" and the "qualifier", together form the heading; the lead is in boldface and is separated from the qualifier by a period-plus-space. The third part is a subheading, called the "display". This general pattern may be represented as
Lead. QualifierThe lead must contain at least one term, but the qualifier or the display may be empty.
Display
The PRECIS index string generation rules are quite complex and cannot be given here in detail. The most fundamental procedure, however, is to: 1. make the access term into the lead; 2. put the terms which precede it in the input string into the qualifier in reverse order; and 3. put the terms following it in the input string into the "display" in their original order. The first two index strings produced from the sample input string show the results of this fundamental procedure, known as "shunting"; the third shows the result of a procedure known as the "predicate transformation":
PHARMACOLOGY, CHEMICAL>DRUG>ANTIBIOTIC; STIMULATION-CIRCULATORY SYSTEM>HEART: STUDY-ANIMAL>RABBIT
The rule for index string generation in the theoretical "basic" version of POPSI is an extremely simple KWOC-like one. Early implementations, however, use cycling (Bhattacharyya and Neelameghan 1969; Ravichandra Rao 1973). The most recent reported POPSI index string generator is fairly KWOC-like; but additional qualifying terms are inserted after the lead term, and generic terms are dropped in the subheading (Mahapatra 1978; Neelameghan and Gopinath 1975; Ravichandra Rao 1976). For example, index strings produced from the sample input string above are:
In a related system, described by Gupta (Gupta 1970), input strings consist of classification codes in the Colon Classification. The software analyzes each classification code and looks up its parts in machine-readable classification schedules. The results are strung together to form a description similar to a POPSI input string; this description is then cycled to produce a set of index strings. The index entries use the original classification codes as the locators. This system seems to epitomize string indexing's advantage of much index from little input; at the same time, heavy reliance is placed on the quality of the classification scheme, the indexer's knowledge of it, and its availability in machine-readable form.
21 manufacture of calories low // potato productsrefers to a "Belgian patent for a method of manufacturing low-calorie potato products, with mention of their possible use in low-calorie snacks and fish products". A number of rules govern the forms taken by the phrases in the first part of a CASIN input string; the somewhat odd expression "calories low" above is the result of the application of one of these rules.
22 potato products use for calories low // snacks
23 " " " " " " // fish products
32 Belgium
41 c
51 21 Calories
52 22 Potatoes
53 23 "
71 21 Potatoes
72 22 //
73 23 //
Each CASIN index string consists of a heading and a subheading. Index string generation begins with one of the categories with a code greater than 50. If there is a term for this category, the index string generator places this term in the heading. If ditto marks are found instead, the heading term is taken from the previous category. If the special symbol ("//") is found, the index string generator uses the main word of the phrase belonging to the linked category. To form the first part of the subheading, the index string generator takes the phrase belonging to the linked category. It inverts this phrase if the heading category code begins with "5"; otherwise, the order of the input string is retained. Later parts of the subheading are taken from categories with codes in the 30's and 40's. The index strings from the sample input string are thus:
Wood, NW 0105 Abstracts 0205 (and their) indexes 0305 Aslib Proc 0405 18/1966/160-166 0505
KWIDR index strings are formed basically by cycling, but with the following variations: 1. the four-digit code for the lead term is inserted after the lead term; 2. the locator is inserted after the third term of the index string; 3. the index string terminates after four terms, except that the first term of the input string must be appended if it has not yet been included. The index entries for the sample input string, where the locator is "R1", are thus:
KWIDR is a somewhat mysterious system and the reasons for
some of its features are obscure.
{ 38}
@MEASURES? OF <INFORMATION CONTENT? OF <DOCUMENT SURROGATES>>? FROM <INFORMATION THEORY>
The NEPHIS index string generator creates an index string by beginning with the phrase associated with an access term. If this phrase is nested, there is one phrase in which it is most immediately nested. The index string generator basically appends to the index string the rest of this immediately larger phrase; a period plus space is inserted first unless the input string supplies a connective for this purpose. This appending process is repeated until the whole input string has been dealt with. Some connectives and all coding symbols are omitted from the index string. Thus, from the input string above, the NEPHIS index string generator generates the index strings:
The second system, LIPHIS (LInked PHrase Indexing System) (Craven 1978), was developed primarily to allow descriptions involving more complex structures of terms and links between terms; it also avoids the many brackets required in longer NEPHIS input strings. LIPHIS is characterized particularly { 39} by two common features of its input strings: 1. numeric codes to indicate places where the index string generator is to see links between terms which are separated in the input string; 2. the equal sign ("=") to indicate breaks in the sequence of links that the index string generator normally assumes between successive terms in the input string. Of the other coding symbols, the at sign ("@") is used, as in NEPHIS, before a non-access term, while the exclamation mark ("!") is used between words in multi-word terms. In addition, the index string generator recognizes an initial upper-case letter as indicating a term and an initial lower-case letter as indicating a connective. A LIPHIS input string for the item on "measures, from information theory, of the information content of document surrogates" is
@Measures 1 of Information!Content of Document!Surrogates = 1 from Information!Theory
LIPHIS divides an index string into a heading, consisting of the lead term, and a subheading. In creating an index string, the LIPHIS index string generator behaves generally quite like the hypothetical index string generator described in Chapter 1: from an initial term, it first follows as many links as possible forward, then follows a link backward, and repeats the process until there are no more links to follow. Thus, given that the sample LIPHIS input string represents the structure
MEASURES---from---*INFORMATION THEORYit is not difficult to foresee that the index strings take the forms:
|
of
|
*INFORMATION CONTENT
|
of
|
*DOCUMENT SURROGATES
The third system, NETPAD (Craven 1982d, 1984; Declerck and Craven { 39} 1983), originated in work with graphical displays of the networks defined in Farradane's Relational Indexing, discussed below. There are two main versions of NETPAD, one written in MAXBASIC for the DECsystem-10 and the other in Commodore BASIC for the PET2001-8. The latter version is the more advanced and will be the one generally referred to in this book. A NETPAD input string has two parts, a table of terms and a table of links. Each term in the table of terms is numbered, and each element in the table of links consists of two term numbers and the number of the type of link.
Like MARC input strings, those for NETPAD cannot be displayed literally; they can, however, be reformated by the NETPAD software into displays readable by human beings. In the display format most similar to the input string, each term in the term list appears, preceded by its number, on a separate line. The linktype table is presented in three columns with the center column containing mnemonic characters for the linktypes. The actual mnemonics used, as well as the linktypes themselves, are the choice of the user. For instance, an input string for the indexed item on "measures, from information theory, of the information content of document surrogates" can be presented as:
Here, the user has chosen the right parenthesis ("(") as a mnemonic symbol for an "of" type of link and the underscore ("_") for a "from" type of link. The more usual display format presents the structure of terms and term links graphically, using the same mnemonic symbols; e.g.,
# Term 1 MEASURES 2 INFORMATION CONTENT 3 DOCUMENT SURROGATES 4 INFORMATION THEORY
# Linktype # 1 ( 2 2 ( 3 1 _ 4
MEASURES
|/(
| INFORMATION CONTENT
| /(
| DOCUMENT SURROGATES
|
/_------INFORMATION THEORY
The basic NETPAD index string generation process is again similar to that of the hypothetical index string generator described in Chapter 1. In an important difference, however, the index string generator considers a link { 39} too "weak" to follow if the weight of the associated linktype falls below a "cutoff threshold". The threshold and the linktype weights are controllable by the user. The purpose of this sort of user control is specifically to allow customizing of index displays to suit specific search needs, especially in online systems. NETPAD users can also control what connectives represent what links in index strings.
Since it is somewhat misleading to try to illustrate NETPAD output with a single set of index strings, two sets, both derived from the input string displayed above, will be used instead:
An input string for Relational Indexing in general is divided into lines, and elements within each line are separated by semicolons (";"); each element begins with a code consisting of a letter followed by an equal sign ("="). As it is for NETPAD, the basic input string is a table of terms and a table of links. The code "s=" precedes each term in the term list; the code "w=" precedes each term number and "r=" each linktype number in the link table. Examples of linktype numbers are: 3 ("distinctness", "/)"), usually meaning "having as substitute"; 6, ("action", "/-"), meaning "affected by"; 7 ("association", "/;"), indicating various types of relationship; and 9 ("functional dependence", "/:"), meaning "yielding".
For the specific purposes of the index string generator, information added in the input string may include: whether terms represent processes ("v=2") or other entities ("v=1"); with which pairs of terms to begin index strings ("l=1"); how to express links in special cases ("p=" plus a connective); and special additional links ("a=" plus a special linktype number and a term number; "g=" plus a link number).
Using the the Relational Indexing system to describe an item on "the use of measures from information theory and coding theory in the measurement of the semantic information content of documents and their surrogates", an indexer creates the input string
v=1;s=documents
v=1;s=surrogates
v=1;s=information content/semantic
v=2;s=measuring
v=1;s=measures
v=1;s=information theory
v=1;s=coding theory
l=1;w=1;r=3;p=of;l=1;w=2
l=1;w=1;a=12;r=7;l=1;w=3;g=4
l=1;w=2;a=11;r=7;w=3
g=2;w=3;r=6;w=4
w=4;r=7;p=for;w=5
l=1;w=6;a=17;p=derived;r=9;p=derived from;w=5
l=1;w=7;a=16;p=derived;r=9;p=derived from;w=5
An index string in this system has two parts: a one-term heading and a subheading. The index string generation process basically involves following links from term to term, beginning with an access term. Special rules must be invoked where more than one sequence of links could be followed; for instance, except for the fact that the link in which the lead term is marked is always followed first, special additional links take precedence over ordinary links. { 43} The index strings from the sample input string will serve as an illustration of the results obtained:
yl1/English literature ta1/1500-1599
ra1/Shakespeare, William pa1/Tragedy
na1/Hamlet lk1/<uso*Hendiadys
ka1/<soi**Virgil ha1/*Linguistic approach
As produced for the MLA International Bibliography, a CIFT index string has three parts: a heading, to be displayed in boldface capitals; a subheading, in mixed upper-and-lower-case boldface; and a subsubheading, in typefaces of ordinary weight. The index string generation rules are KWOC-like, with some exceptions, such as: certain parts of the input string are standardized automatically by reference to a thesaurus; the codes and abbreviations are translated to read more naturally; terms preceded by certain "role" codes, such as "soi", have some indication of these codes appended when they are lead terms; a lead term repeated in the subheading is capitalized; author dates appear only in the heading. For example, the index strings from the sample input string are:
SHAKESPEARE, WILLIAM (1564-1616)
See also classified section: I 1831 ff.
GERMAN-ENGLISH # PHYSICS # DICTIONARY
The basic index string generation process in this system is that of cycling. When more than two terms are involved, however, two additional index strings are normally generated in the forms: "first term - last term - middle"; and "last term - middle - first term". Thus the input string above yields five index strings:
The Iowa State University index string generator, like that for Double-KWIC, actually has two stages. The first stage produces one or more intermediate strings making explicit any alternatives indicated by parentheses and semicolons. For example, from the input string
(CIVIL ENGINEERING) # (TABLES; ≠ HANDBOOK)describing a "handbook of civil engineering, with tables", the intermediate strings produced are:
Like PERMUTERM and SLIC, The Iowa State University system emphasizes number of index strings. For example, the input string
(ENERGY; COAL; PETROLEUM; NUCLEAR ENERGY; SOLAR ENERGY; GEOTHERMAL ENERGY; GASOLINE; HYDROPOWER; POWER ENGINEERING; INDUSTRIAL ENGINEERING) # (PHYSICAL CONSTANTS; ≠ HANDBOOK; DICTIONARY)assigned to Energy technology handbook, yields no fewer than 50 index strings. On the other hand, the index strings are usually short. Thus, the original Mischo index to the Iowa State University Library Reference Collection truncates all index strings at 38 characters, but only about 5% lose any characters in this way.
213ACCIDENTS$M03SMALL$M13FOREIGN$
313AUTOMOBILES$413STATISTICS$
Unlike most string indexing systems, PERMDEX does not regularly put the locator at the end of the index entry, but inserts it between the heading and the subheading. For example, where the locator is "1234", the index entries produced from the input string above are:
Sorghum bicolor, Mutation, Hydrazine: Ethyl methanesulphonate: Gamma rays, *Effect
The PASI index string generator produces the index strings basically by cycling the segments, with three variations: 1. no term preceded by an asterisk in the input string is made a lead term; 2. when a multiterm segment is cycled into the lead position, the terms within it are cycled in turn; 3. when a segment beginning with a parenthesized preposition is cycled into the lead position, the parenthesized preposition is omitted. Thus, the index strings generated from the input string above are:
3 - WORKERS COMPENSATION INSURANCE - SELF INSURANCE - COSTSand
5 - RESERVES - WORKERS COMPENSATION - SELF INSURANCE
The basic rule for index string generation in the NILS system is to make every possible permutation of the access terms in the input string while leaving non-access terms unmoved. Thus, the index strings resulting from the first input string above are:
PEACE, rest, quiet
logic: epistemic 112has a slightly different locator from its complement
epistemic logic 112fSimilarly, the index entries
modal predicate logic 110and
predicate logic: modal 110are not properly complemented by the index entries starting with "logic"
logic: modal 108fand
logic: predicate 58f, 148fwhich give no indication to a searcher that information on a kind of predicate logic is available on page 110.
Multiple overlapping entries are often possible in using lists of subject headings such as the Library of Congress Subject Headings. Again, their generation is not by computer software. Indexers may, however, sometimes be guided to generate multiple overlapping entries according to explicit rules. For example, in the Library of Congress system, an item assigned a subject heading in the form "Country 1 - Relations (Military) with Country 2" must also be assigned one in the form "Country 2 - Relations (Military) with Country 1"; e.g.,
The first part of a Systematic Indexing index entry, the "statement", consists usually of two to three terms. The second part, the "amplification", can be quite long, containing a sort of abstract, or "extension", as well as: date of information; author(s); name of publication, place and date, pagination, edition; and a locator in the form of a call number. A sample "statement" and "extension", for an item on "60-80% increase of paper prices in India due to scarcity", are
(statement) PAPER
INDIA
DEMAND(extension) Prices have advanced 60-80% owing to scarcity.
A statement must contain one "process" term; in addition, it must contain either a "concrete" term or one or two "country" (geographical location) terms, or both. Concrete terms and country terms are always access terms; process terms never are. For example, an item on "the export of electric traction motors from Italy to France" is assigned: the process term "EXPORT TRADE"; the concrete term "ELECTRIC TRACTION MOTOR"; and the country terms "ITALY" and "FRANCE". The statements in the corresponding entries are:
T education T9(Y31) rural education T9(Y31).4 rural education in Asia T9(Y31).44 rural education in India T9(Y31).44'N3 rural education in India in the 1930's T9(Y31).44'N3s statistics on rural education in India in the 1930's.
The compiler of a chain index sees to it that there are index entries for various broader and narrower classes to which the classification scheme assigns each item in the sequence. A chain index avoids waste by starting the index entry for each class in a chain with a different term; for this reason, some classes in the chain may in fact have no index entries. For example, a chain index to a sequence classified by the Colon Classification would contain the following entries for the sample item above:
The most common procedure in chain index entry generation, exemplified by all but the first of the index entries above, is quite similar to that for TABLEDEX: starting with an ordered list of terms and successively removing terms from the beginning until no terms remain. Chain indexing nevertheless shows two important differences: 1. the terms are ordered, not alphabetically, but in an order corresponding to the reverse of the chain; 2. the locator is changed with each term dropped from the term list.
A newly classified document may result in several new index entries in a traditional chain index; these index entries do not, however, point directly to the document itself, but each to a different class of documents. Moreover, entry generation is not an automated process. Thus, traditional chain indexing is also not a form of string indexing according to the definition given in this book. Its importance here lies in its influence on the design of POPSI and PRECIS, both directly and through Coates' adaptation.
Coates' chain procedure (Coates 1960, 1969; Coates and Nicholson 1967), adopted for British Technology Index, the predecessor of Current Technology Index, is based on that of Ranganathan. But Coates modified Ranganathan's method for use without a classification scheme or a classified sequence. Coates' work is especially important because of its explicit rules and because of its computerization. As in string indexing, the indexer constructs input strings which the software uses to construct the index. Unlike string indexing, only one index entry normally results from each input string; however, one or more cross-references are also produced. For example, for an item on "the turbulent flow of water in pipes", the description part of the single index entry is
WATER:Flow,Turbulent:PipesFrom the same input string, however, the software also produces the cross-references
The TOPSI-UNIV indexer first types in every word or number that is to appear in any index entry for the indexed item. The TOPSI-UNIV program assigns a one-letter code to each word or number typed in. The indexer then types in a specification for each index string to be produced for the indexed item. Each specification is identical with the corresponding index string except that: 1. each word or number is represented by its one-letter code; 2. the beginning of a subheading is indicated by a plus sign ("+"). Finally, the indexer types "GO" to indicate the end of the input string.
TOPSI-UNIV can imitate the index strings of almost any string indexing system. For example, an indexer wishing to obtain PRECIS-like index strings for the item on "applying information theory to measuring the information in document surrogates" could first type
APPLICATIONS OF INFORMATION THEORY TO MEASUREMENT OF INFORMATION CONTENT OF DOCUMENT SURROGATESThe TOPSI-UNIV program would code this phrase as
AAPPLICATIONS BOF CINFORMATION DTHEORY EIN FMEASUREMENT GOF HINFORMATION ICONTENT JOF KDOCUMENT LSURROGATESand the indexer could then enter the specifications
KL+HI. F. ABCDThe resulting index strings would be:
HI. KL+F. ABCD
CD+AEFGHIJKL
GO
String indexing systems with ordinary-language input strings include cycling and KWIC systems, KWOC systems, PANDEX, PERMUTERM, Double-KWIC, ASI, and KWPSI. A cycled index string consists of a lead term, the part of the input string following the lead term, a dividing symbol, and the part of the input string preceding the lead term; KWIC is a common way of displaying the output of the cycling process. A KWOC index string consists of a lead term plus the unmodified input string; in a variant, an omission symbol is substituted for the repeated lead term. PANDEX is an elaboration of KWOC. In PERMUTERM, each index string consists of two terms, and a number of index strings from one input string may share the same lead term. The latter is also true of Double-KWIC, which is a combination of cycling and variant KWOC. ASI and KWPSI represent somewhat more complex manipulation of input strings, involving segmenting at different levels.
String index systems with term-list input strings include the CLASE system, ABC-Spindex, TABLEDEX, SLIC, and, somewhat marginally, MULTITERM. The CLASE system is much like PERMUTERM. ABC-Spindex, TABLEDEX, and SLIC all alphabetize the input list of terms; index string generation is KWOC-like in ABC-Spindex and involves selective term omission in TABLEDEX and SLIC. MULTITERM appends role codes to terms and has a cycling type of index string generator.
String indexing systems with coded input strings include Statement Indexing, automated library catalog display systems, PRECIS, POPSI, CASIN, KWIDR, NEPHIS, LIPHIS, NETPAD, Relational Indexing, CIFT, the Iowa State University system, PERMDEX, PASI, and the NILS system. Statement Indexing is an early proposal involving a somewhat KWOC-like index string generator. Automated library catalog display systems are included here for completeness but cannot be discussed in detail. The well-known PRECIS system requires a nine-character code for each segment of the input string; several procedures are applied in index string generation, of which { 56} the chief is shunting. POPSI employs a simpler form of input string coding; the main procedure in index string generation is KWOC-like, but with omissions and with one or two terms inserted after the lead term. CASIN specifies access terms separately from other information in the input string; an access term is associated with a phrase earlier in the input string that will be used to generate the beginning of the subheading. KWIDR is a rather mysterious, though simple, system in which coding is added to ready-made data. NEPHIS make use of the idea of the nesting of phrases in the input string, producing index strings similar to those from ASI, but with fewer connectives. LIPHIS, NETPAD, and Relational Indexing represent moves toward more explicit representation of networks of term links. CIFT is specially designed for documents on language, literature, and folklore; index string generation is KWOC-like. The Iowa State University system emphasizes the production of many relatively short index strings. PERMDEX, PERMDEX, and the NILS system are all quite simple: PERMDEX is derived from PRECIS; PASI uses cycling rather than shunting.
A few indexing methods very close to string indexing are manual title catchword systems, various forms of manual cross-indexing, Kaiser's Systematic Indexing, unit card systems, the Universal Decimal Classification, chain procedures, and the Universal Index Entry Generator. Manual systems often show various degrees of inconsistency. Kaiser's system, on the other hand, is a string indexing system in all but computerization. Unit card systems are the ancestors of automated library catalog display systems. The Universal Decimal Classification and the Universal Index Entry Generator are pieces of string indexing systems rather than wholes. The chain procedures of Ranganathan and Coates are important influences on the later development of string indexing systems.
| <-- Chapter 1: Introduction | Contents | Chapter 3: Input --> |