{ 57}

CHAPTER 3
INPUT

This chapter will discuss some aspects of input in the various string indexing systems. Input includes not only input strings but also other, ancillary, kinds of input, such as stoplists, thesauri, and linktype tables. Since the input strings themselves are the main input, however, they will be dealt with in the most detail.

In order to produce good index strings, an index string generator must be designed to be affected by, or recognize, certain features of input strings. Sometimes, as in a highly-coded system such as PRECIS, these features are designed into the input strings to produce desired results; sometimes, as in a system in which the input strings are already-assigned descriptions such as titles, the software has to make use of existing features of data originally supplied for another purpose.

The kinds of input string features recognized by an index string generator may be divided broadly into those relating to terms and those relating to links between terms. Some recognition of term-related features is universal in index string production. Link recognition is not universal, but may be very useful in promoting good index string qualities such as eliminability and collocation.

3.1 TERMS

The general axiom that a term is the name of a thing or class of things related in a significant way to the indexed item still leaves some specific questions to be answered. What forms should terms take when more than one name is possible? How are term boundaries to be recognized by the index string generator? How are access terms to be distinguished from non-access terms? { 58}

3.1.1 Form of terms

General criteria for choosing terms in any indexing for human use include: unambiguousness; familiarity to searchers; brevity; and relevance in distinguishing and grouping indexed items. For index displays, one can add: contribution to collocation; and predictability of filing. In string indexing, terms may also be chosen partly on how well they fit a variety of positions in index strings.

Ambiguous terms may be allowed in string indexing. This is obviously true of systems using uncontrolled ordinary-language input such as titles; but even PRECIS allows the term "docks", for example, to refer either to weeds or to places where ships may be found. The presence of other terms in the input string is relied upon to resolve the searcher's uncertainty about which meaning is intended. The hope is that, while an individual term sometimes lacks clarity, the complete index strings do not, and that eliminability is not too much degraded. Some systems do require terms to be unambiguous; for the purpose of disambiguation, parenthetical qualifiers may be allowed as parts of terms such as "FILE (TOOL)".

Some collocation is bound to occur anyway because of the structure of ordinary language. For example, index entries beginning with "PRESERVATION" and "PRESERVATIVES" will be close together because of their shared stem "PRESERVA". Terms may also follow rather artificial structures in order to improve collocation. Thus, MULTITERM allows terms such as "CATALYST: BORON OXIDE", "POLYMER: ACRYLATE", and "STABILITY, BURNING". An extreme example of terms artificially structured to improve collocation are the codes in a hierarchical classification scheme.

The extent to which collocation is a function of terms, rather than of the structure of the index string, depends partly on what is considered a term. Most systems allow some some terms consisting of more than one word; but, at some point, the parts of multiword expressions become sufficiently independent that they must be considered separate terms and connectives. Where this point occurs depends on several factors: the string indexing system's own peculiarities; whether one part of a multiword expression indicates that another is meant in a metaphorical or unusual sense; whether searchers might want access via some later part; what kinds of cross-references are used; whether the parts can be converted readily into terms and connectives without distorting the meaning; and how familiar searchers are likely to be with the expression. The dividing line between multiword terms and multiterm expressions, in addition to coming at different places, also has different degrees of fuzziness depending on the nature of the string indexing system; in a system like Relational Indexing the line is quite clear, while it is less so in one like NEPHIS. { 59}

3.1.2 Term indication and recognition

In most string indexing systems, the input string must indicate, and the index string generator must be able to recognize, where a term begins and ends.

In a simple title-based system, the string index generator may recognize any combination of blank spaces or punctuation as marking a term boundary. Terms would normally be assumed to be single words. When an index designer using such an index string generator desires multiword terms, indexers may be instructed to substitute other characters for any spaces within a term. For example, indexers preparing input strings for the early KWIC index to Biological Abstracts join the words of a multiword term with hyphens. The index string generator may also be designed to recognize a character not normally used within a term as separating terms. Thus, the Biological Abstracts indexers mark off parts of words with slashes, and these word parts are recognized as terms by the software (Parkins 1963).

Term boundaries may not even have to be marked by particular characters. The software may recognize multiword terms automatically, by checking against a list of common multiword terms, as is done for the PERMUTERM index to Science Citation Index (Fenichel 1971; Garfield 1976; Neufeld and others 1973). Terms within words can be recognized in a similar fashion. Chemical names in the Cambridge Crystallographic Database are automatically analyzed into constituents by use of lists of common chemical prefixes, suffixes, and derivative names and a list of element-name roots (Allen and Town 1977). Automatic analysis of chemical names has also been developed at Chemical Abstracts Service (Heym and others 1976).

3.1.3 Access terms

Index string generators recognize access terms in three main ways. The first way is by general term characteristics; for example, a KWOC program may automatically exclude any term under three characters long. The second is by comparison with a stoplist or golist. The third is from explicit codes in input strings.

Two objections have been cited (Helbich 1969) to the first two methods, whereby either every occurrence of a given word in any description is an access term or every occurrence is a non-access term. The first objection, which applies mostly to the second method, is simply the inefficiency of the index string generator; stoplists, for example, may be extremely long and require considerable time to search. The second objection is that many words may be useful as access terms in some descriptions but not in others; for example, "program" in "computer program" but not in "sterilization program", or { 60} "division" in "cell division" but not in "employment division" (Feinberg 1972, p. 79).

The third method usually means that the indexer inserts codes next to specific terms in the input string.  Access terms may be marked, or non-access terms, or both. CIFT indexers do the first, using an asterisk ("*"). NEPHIS and PASI indexers do the second, with an "at" sign ("@") and an asterisk respectively. PRECIS input takes the third route. In PRECIS, if the first term in a segment is an access term, the value in column 4 is a "1"; otherwise, it is a "0". If a later term in a segment is an access term, it is preceded by "$2" or "$3" followed by a level code; otherwise, the level code is preceded by "$0" or "$1".

Coding sometimes indicates access terms in other ways. Thus, in the NILS system, the single numeral at the beginning of the input string identifies all the access terms. An indexer using the Relational Indexing index string generator inserts a code, not at a term in the term list, but at a term number in a link in the link list.

A CASIN indexer indicates access terms in a special part of the input string; namely, the part with "category" codes greater than 50. This separate listing in CASIN often involves repeating an access term that also appears in the earlier part of the input string. In certain cases, however, repetition is avoided; for example, by the use of the "//" code, which indicates that the access term is the main word of the phrase forming the first part of the subheading.

Access under a given word can often be suppressed simply by treating it as a connective or as a later part of a multi-word term. For example, the PRECIS input string

$z11030$ainformation retrieval systems
$z31030$ausers
$zp1030$aquestions
$z20030$aanalysis
suppresses access under "retrieval" and "systems" simply by making these words part of the term "information retrieval systems"; access under "analysis", meanwhile, is suppressed more explicitly by means of the "0" code in column 4.

3.2 LINKS

A link, or direct connection between two terms in a description, has two main aspects: 1. its type, corresponding to the relationship between the things represented by the terms; and 2. which two terms it connects. For example, suppose a description contains the three terms "DRAWING", "PRE-{ 61}SCHOOL CHILDREN",  and "PRIMARY-SCHOOL CHILDREN". A link here might be of an action-object ("of") type or of an action-agent ("by") type; and it might be between "DRAWING" and "PRE-SCHOOL CHILDREN" or between "DRAWING" and "PRIMARY-SCHOOL CHILDREN". Depending on the types of links and the terms connected, the description might correspond to: "drawing of pre-school children by primary-school children"; "drawing of primary-school children by pre-school children"; "drawing of pre-school children and primary-school children"; or "drawing by pre-school children and primary-school children".

3.2.1 Link types

String indexing input strings may indicate the type of link between a pair of terms by several devices, including: 1. the order of the terms; 2. a connective such as a preposition or participle; 3. a code such as a "role operator" (PRECIS or CIFT), "relation" (Relational Indexing), "facet indicator" (CIFT), "category" (CASIN), or "linktype mnemonic" (NETPAD). A single link may sometimes be assigned to a type by means of several devices at once.

Assigning links between terms to types in the input string has two major purposes: to control the order of the terms in the index strings and to specify the connectives used in the index strings. Not surprisingly, term order in the input string tends to serve the first purpose and connectives in the input string the second. Codes in the input string also tend to serve the first purpose.

Examples of the tendencies of term order to be determined by term order, and by codes, and of connectives to be determined by connectives are easy to find. Thus, in PRECIS, the order of terms in the input string, together with the role codes in column 3, primarily controls term order; and the connectives following "$v" and "$w" appear as connectives in index strings.

The tendencies are, however, far from universal. For example, in ASI or KWPSI, a connective such as "OF" categorizes a link for both purposes; in NETPAD, a linktype code does. In NETPAD, moreover, input string term order is irrelevant to the order of the terms in the index strings. The use of codes for either purpose tends to characterize those systems, like NETPAD, with more sophisticated rules for arranging the terms in the index strings.

Each method of indicating linktypes has its own advantages and disadvantages. The use of connectives to determine connectives and term order to determine term order clearly appeals to the "what you see is what you get" principle; provided, that is, the connectives and term order in the index string are similar to those in the input string. On the other hand, if the connectives desired are fairly long, indexers may prefer to save time and avoid { 62} errors by entering codes instead; CIFT's "role" codes are one example. For more sophisticated index string generators, the single term order of the input string is often insufficient to determine the term orders of all the index strings; thus, some additional assistance from connectives or codes is required.

3.2.2 Link structures

Which terms are linked to which in an input string is important mainly as it affects the order of terms in the index strings. How much the index strings are affected will vary with the system: more sophisticated systems show greater effects.

Taken collectively, the term links represented by an input string form various types of structures. Different index string generators are designed to recognize structures of different degrees of complexity. Structures may be recognized to some extent in ordinary language, but index string generators tend to recognize more complex structures on the basis of coding composed by indexers.

Where every term is linked with at least one other term, roughly three degrees of structural complexity may be recognized in input strings: 1. linear; 2. tree; 3. network. In the linear structure, one term has a link with a second, the second a link with a third, and so on, so that a single sequence of links passes from the first term to the last through all the other terms; e.g.,

TERM1------TERM2------TERM3------TERM4
In the tree structure, a single sequence of links need not cover every term, but every pair of terms still has a single sequence of links between them; e.g.,
TERM1------TERM2------TERM3
             |
             |
           TERM4
Finally, in the network structure, no restriction is placed on how terms may be linked; thus, there may be more than one sequence of links between two terms, as in
TERM1------TERM2
  |          |
  |          |
TERM3------TERM4
All linear structures are, by definition, trees; similarly, all trees are networks. Thus, an index string generator which can recognize networks can recognize { 63} tree structures, and one which can recognize trees can recognize linear structures.
3.2.2.1 Linear structures
Provided the terms are so arranged in the input string that each term is linked with the terms adjoining it, linear structures present no problems either of indication or of recognition. Problems arise only when the order of terms allows unlinked terms to adjoin. Take, for example, an item on "students' research on voting". Provided the terms are arranged in the order "STUDENTS - RESEARCH - VOTING" or "VOTING - RESEARCH - STUDENTS" the link structure can remain clear. If, however, the order is changed to "RESEARCH - VOTING - STUDENTS", as in
RESEARCH on VOTING by STUDENTS
the structure is obscured; for example, the meaning may appear to be
*RESEARCH
|
on
|
*VOTING
|
by
|
*STUDENTS
in which the students are voting rather than researching. For this type of order, more general tree-structure indication and recognition methods are required.

It is difficult to point to a string indexing system which clearly exploits in its index strings linear and only linear link structures. On the other hand, several designers take linear structures as a simple base for initial development of their string indexing systems; recognition of other structures is then added onto the basic design. For example, an assumption of a linear sequence underlies the shunting procedure fundamental to PRECIS and PERMDEX, though neither system is limited to shunting or to purely linear sequences in input strings. LIPHIS and PASI likewise show a linear basis. An assumption of linear structure in most ordinary-language input strings possibly underlies cycling and KWIC and the derivative Double-KWIC.

3.2.2.2 Tree structures
As an example of a tree structure, take the network diagram representing the topic "access by the public to government information on meat testing": { 64}
*ACCESS------------------------
|                             |
to                            by
|                             |
*INFORMATION-----             *PUBLIC
|               |
of              on
|               |
*GOVERNMENT     *TESTING
                |
                of
                |
                *MEAT

In a tree structure of term links, it is useful to designate one term as the "root". A term's level is the number of links in the sequence of links connecting the term with the root; e.g.,

level 0 (root) *ACCESS----------------
|                     |
to                    by
|                     |
level 1 *INFORMATION---       *PUBLIC
|             |
of            on
|             |
level 2 *GOVERNMENT   *TESTING
              |
              of
              |
level 3               *MEAT

A common way of representing a tree in computer input is to use brackets of some sort to mark a change in level: typically, a lefthand bracket for an increase in level number and a righthand bracket for a decrease. Among string indexing systems, NEPHIS, which always defines a tree in its input strings, employs angular brackets ("<", ">") to mark level changes; for example,

0ACCESS? to <1INFORMATION? of <2GOVERNMENT>? on <2TESTING? of <3MEAT>>>? by <1PUBLIC>

The chief problem with using brackets to mark level changes is familiar to anyone who has glanced at the LISP programming language: entering the correct number of righthand brackets to match the lefthand brackets can become very laborious and mistakes can be all too frequent. For example, the string

@EXPERIMENTAL USE? of <COMPUTERIZED SYSTEM? for <CONTROL? of <INFORMATION? in <OFFICES>>>? at <B-N SOFTWARE RESEARCH>
{ 65} is not a legal NEPHIS input string because it is missing one righthand bracket; yet neither this fact nor where the missing bracket should be inserted may be immediately obvious.

Farradane has described another method of coding trees by means of brackets; in this case, square brackets ("[" and "]") (Farradane 1950). In Farradane's method, brackets mark, not changes in level, but beginnings and endings of branches in the tree; the last of a group of branches need not be marked at all. Matching of brackets is much less of a problem here. For example, by Farradane's method, the structure of the topic "access by the public to government information on meat testing" can be coded with only two pairs of brackets:

ACCESS to [INFORMATION [of GOVERNMENT] on TESTING of MEAT] by PUBLIC

A second approach to tree coding is to tag terms explicitly with their level numbers. Each term can then be assumed to be linked with the term with a lower level number that most immediately precedes it in the input string. Modifications of PRECIS since 1974 take this approach when dealing with adjectives (Richmond 1981). To an adjective term in an input string, the PRECIS indexer prefixes a dollar sign ("$") plus a two-character code the second character of which is the level number. For instance, the input string

     $z11030$asculptures$21stone$21German
refers to "German sculptures made of stone"; i.e.
level 0   *SCULPTURES-----
           |              |
           (made of)      (which are)
           |              |
level 1   *STONE         *GERMAN
By contrast, the input string
$z11030$asculptures$21stone$22German
with the "1" before "German" changed to a "2", means that the sculptures are made of German stone; i.e.
level 0    *SCULPTURES
            |
            (made of)
            |
level 1    *STONE
            |
            (which is)
            |
level 2    *GERMAN
{ 66}

A third approach to tree coding can be seen in one version of POPSI input:

Leather technology 6 Leather. Light leather 6.1 Tanning. Mineral Tanning. Chrome Tanning. Two Bath Chrome Tanning - (Agent used) Dichromate 6.1.1 Evaluation
Here, a term is not necessarily preceded just by a code indicating the type of link from the preceding term; instead, a series of codes sometimes categorizes the sequence of links that connects the term with the root. For example, "6.1.1 Evaluation" specifies that the term "Evaluation" is connected with the root term "Leather technology" by a series of links categorizable as "entity + process + process".

Even without a general method for coding all trees, an index string generator may recognize at least some tree structures. Different types of codes or connectives may indicate different parts of the tree. Codes function in this way in PRECIS, POPSI, and PERMDEX; ordinary-language connectives such as prepositions, in ASI and KWPSI.

A simple PRECIS example is provided by an input string for a hypothetical item on "photography of garden fruit trees":

$z11030$agardens
$zp1030$atrees$21fruit
$z21030$aphotography
Here, the fact that "fruit" is preceded by a three-digit code rather than a nine-digit code shows that it is not in the main sequence of terms. Hence, the PRECIS software recognizes the structure as
*PHOTOGRAPHY
|
(of)
|
*TREES---(characterized by)---*FRUIT
|
(in)
|
*GARDENS
On the other hand, a hypothetical input string for "photography of the fruit of trees in gardens" is
$z11030$agardens
$zp1030$atrees
$zp1030$afruit
$z21030$aphotography
Here, by contrast, "fruit" is preceded by a nine-digit code, and the PRECIS software recognizes the structure as { 67}
*PHOTOGRAPHY
|
(of)
|
*FRUIT
|
(of)
|
*TREES
|
(in)
|
*GARDENS

Both the ASI and the KWPSI index string generators categorize the connective "OF" differently from other prepositions such as "ON" and "BY". Both use this categorization more or less to recognize the input string

RESEARCH ON PREVENTION OF CRIMES BY SOCIOLOGISTS
as representing a structure like
RESEARCH---by---SOCIOLOGISTS
|
on
|
PREVENTION
|
of
|
CRIMES
Analysis based on ordinary-language input strings is, however, not so reliable as that of coded input strings. Thus, both the ASI and the KWPSI index string generators analyze in the same way as above the superficially similar input strings
RESEARCH ON PREVENTION OF CRIMES BY POLICE
and
RESEARCH ON PREVENTION OF CRIMES BY GANGS
where the same structure may not represent what is intended.
3.2.2.3 Network structures
To date, no index string generator can recognize more complex structures than trees in ordinary-language input strings. For a non-tree structure to be recognized, the indexer must always add coding to the input string. The three { 68} main approaches used are: a separate list of all links; special codes indicating additional connections in a simpler structure; and codes for specifying some common non-tree structures.

The first approach, that of a separate link list, is taken in Relational Indexing and in its derivative NETPAD. Each term in an input string is numbered, explicitly or implicitly, and each element in the link list contains a term number, a linktype code, and another term number.

An unaided indexer constructing a link list is likely to be somewhat slow and prone to error. The indexer must keep checking the list of terms to determine the appropriate term numbers when defining the links. This procedure is especially tedious if many of the descriptions are in fact linear. Take, for example, the linearly structured topic "a mathematical model for predicting the circulation of documents in a library". A Relational Indexing input string for an item on this topic is:

v=1;s=library
v=1;s=documents
v=2;s=circulation
v=2;s=predicting
v=1;s=model/ mathematical
l=1;w=1;r=7;l=1;w=2
w=2;r=6;p=of;l=1;w=3
w=3;p=*;r=6;w=4
w=4;p=by;r=3;p=of;l=1;w=5
To construct the link list of this input string, the indexer has to record the correct term number no less than eight times, once for each of the "w=" codes.

The second approach, that of indicating additional links in a simpler structure, is taken by LIPHIS. In LIPHIS, the simpler structure is linear.  As long as terms in the input string are linked one to the next, a LIPHIS indexer needs to indicate no additional links; e.g.,

Mathematical Model for Predicting of Circulation of Documents in Library
To specify more complex structures, the indexer marks breaks in the linear sequence with the equals sign ("=") and equivalent points with numerals. Take, for example, the input string for an item on "attitudes of students in universities to courses in those universities":
Attitudes 1 of Students 2 = 1 to Courses 2 in Universities
Here, the "=" indicates that "STUDENTS" is not linked to "COURSES" by a "to" link; the "1"'s indicate that "ATTITUDES" is linked, not only to "STUDENTS" by an "of" link", but also to "COURSES" by a "to" link; and the "2"'s indicate that "STUDENTS" and "COURSES" are both { 69} linked to "UNIVERSITIES". The recognition of the more complex structure by the index string generator is evident in the resulting index strings:
  1. Attitudes
        of Students in Universities to Courses
  2. Courses
        in Universities. Attitudes of Students
  3. Students
        in Universities. Attitudes to Courses
  4. Universities
        Courses. Attitudes of Students.
The LIPHIS approach seems to work well when most structures are linear, where it is quite economical, but seems to be confusing to indexers when more than a little additional coding is required.

Unlike the first two approaches, the third approach, that of codes covering common non-tree structures, allows the indexer to represent only certain types of structures of term links. These types of structures are assumed to be useful, and other types are ignored as insignificant. The use of the "3" "agents, factors" code in column 3 in PRECIS input strings will serve as an example. A PRECIS input string for the topic "attitudes of students in universities to courses in those universities" is

$z11030$auniversities
$zp1030$acourses
$zs1030$aattitudes$vof$wto
$z31030$astudents
and the corresponding index strings are:
  1. Attitudes. Students. Universities
        To courses
  2. Courses. Universities
        Attitudes of students
  3. Students. Universities
        Attitudes to courses
  4. Universities
        Courses. Attitudes of students
The "3" in column 3 of the last line of the input string causes the PRECIS index string generator not to follow the standard shunting procedure when generating index strings 1 and 3 above; instead, it follows the "predicate transformation" procedure. The index string generator can thus be said to recognize the link between "students" and "universities" in this case.
3.2.2.4 Simplification of structures
Indexers are often faced with items best described with structures not recognized by the index string generators with which they are working. Several { 70} choices are then open to them, each with its advantages and drawbacks. Three common choices are: 1. sacrificing one or more links; 2. duplicating parts of the description; 3. creating more than one input string. The first choice may cause a loss in predictability; the second, a loss both in succinctness and in efficiency of index production; and the third, an even greater loss in index production efficiency.

As a sample case for all three choices, take again the topic "the attitudes of students in universities to courses in those universities", this time in a system like NEPHIS. The topic seems to have four significant relationships between the things represented, as indicated in the following network diagram:

*ATTITUDES---of---*STUDENTS
|                 |
to                in
|                 |
*COURSES---in-----*UNIVERSITIES
The diagram shows two sequences of links between "ATTITUDES" and "UNIVERSITIES", and thus the structure of the topic is not a tree. An index string generator which recognizes the network structure should readily produce, for this item, consistent, predictable index strings such as those already illustrated for LIPHIS and PRECIS. The NEPHIS index string generator, however, recognizes only tree structures.

A NEPHIS indexer taking the first choice, that of sacrificing links, could describe the item as being on attitudes of students in universities to some, otherwise unspecified, courses:

ATTITUDES? of <STUDENTS? in <UNIVERSITIES>>? to <COURSES>
i.e.,
*ATTITUDES---of---*STUDENTS
|                 |
to                in
|                 |
*COURSES          *UNIVERSITIES
Here, the link between "COURSES" and "UNIVERSITIES" has been sacrificed. Loss of predictability comes with the index string beginning with "COURSES", which now reads
COURSES. ATTITUDES of STUDENTS in UNIVERSITIES
Searchers are still likely to find index strings beginning with phrases such as "STUDENTS in UNIVERSITIES"; thus, they may expect that, by analogy, all indexed items relating to university courses have index strings { 71} beginning with "COURSES in UNIVERSITIES". As a result, the index string above may be missed.

Predictability is especially impaired if the indexer follows no rule for sacrificing links. A typical result is two index strings such as

COURSES. ATTITUDES of STUDENTS in UNIVERSITIES
and
COURSES in UNIVERSITIES. ATTITUDES of STUDENTS
referring to two indexed items on the same topic.

An indexer taking the second choice, that of duplicating parts of the structure, can describe the document as being on attitudes of students in some universities to courses in some, not explicitly the same, universities:

ATTITUDES? of <STUDENTS? in <UNIVERSITIES>>? to <COURSES? in <UNIVERSITIES>>
i.e.,
*ATTITUDES---of---*STUDENTS
|                  |
to                 in
|                  |
*COURSES          *UNIVERSITIES
|
in
|
*UNIVERSITIES
Here, the "in UNIVERSITIES" part of the description has been duplicated. The result will be a longer input string and longer index strings, such as
ATTITUDES of STUDENTS in UNIVERSITIES to COURSES in UNIVERSITIES
Unwieldy strings slow the indexer, the index string generator, and the searcher, and take up additional storage space.

An indexer taking the third choice, that of more than one input string, would describe the indexed item in two different ways, marking appropriate access terms for each; e.g.,

ATTITUDES? of <STUDENTS in UNIVERSITIES> to COURSES
and
@ATTITUDES of STUDENTS? to <COURSES? in <UNIVERSITIES>>
{ 72} i.e.,
*ATTITUDES---of---*STUDENTS
|                 |
to                in
|                 |
COURSES           UNIVERSITIES
and
ATTITUDES---of---STUDENTS
|
to
|
*COURSES---in---*UNIVERSITIES
Here, the index strings produced are the same as those from an index string generator which does recognize the network structure. The failure lies in the added work required of the indexer and of the index string generator. The use of multiple input strings in fact opposes one of the major advantages of string indexing, that of many index entries from a small amount of input.

3.3 ANCILLARY INPUT

Input strings can be considered the responsibility of the indexer; other input required by string indexing software comes from thesaurus managers, from index producers, or even from searchers. This ancillary input takes various forms: lists of parts to be recognized as belonging to certain categories when they occur in input strings; information for controlling vocabulary by standardization and cross-referencing; tables determining how different types of link will be treated in generating index strings; or simple numeric values controlling such features of index strings as the number of terms.

3.3.1 Lists for categorizing parts of input strings

Lists for categorizing parts of input strings are usually very simple, the stoplist being the most common example. Usually they are merely lists of words and perhaps abbreviations; e.g.,
A
About
Ad
All
Among
An
Another
Are
As
At
{ 73} and so on (Bernier 1968). Such lists may be compiled by hand or they may be generated automatically from data on the frequency of words and phrases.

Sometimes, the structure of the elements of a stoplist is somewhat complex, including, for example, information on exceptions. Thus, in the stoplist for the Cambridge Crystallographic Database index to chemical names, the entry

METHYL*ENE/ENOMYCIN/IDE/IDYN/IUM
indicates that "METHYL" should not begin an index string except as part of the words:
METHYLENE
METHYLENOMYCIN
METHYLIDE
METHYLIDYN
METHYLIUM
(Allen and Town 1977).

Stopwords follow a Bradford-type law of scattering, as do other elements in a vocabulary; thus a few stopwords will tend to occur very frequently in input strings, while others are much less frequent. A short stoplist which concentrates on the most frequent words will therefore tend to produce results almost as good as a longer stoplist. In a study of 50 entries from Chemical Titles, Feinberg (Feinberg 1972, p. 95, 1973, p. 58) found that a stoplist of only 16 words eliminated 29% of the words in the titles, while a stoplist of 400 words eliminated only 31%.

An index string generator need not be limited to one or even two ancillary input lists.  Six different lists are used in producing PERMUTERM indexes at ISI: 1. a "unique word dictionary" of "correct, verified words which have previously appeared in titles"; 2. a "full-stopword dictionary" of terms which cannot appear in index strings; 3. a "semi-stopword dictionary" of terms which cannot be access terms; 4. a "stop pair dictionary" of disallowed index strings; 5. a "variant spelling dictionary" for automatic standardization of spellings; and 6. a "word phrase dictionary" of multiword terms (Fenichel 1971).

3.3.2 Thesauri

A thesaurus is an information retrieval tool which records relationships between the things represented by terms regardless of the specific descriptions in which the terms appear. Typical of the relationships recorded in a thesaurus is that between a subclass and a broader class; for example,
PENGUINS
    BT   BIRDS
Most thesauri also indicate which terms are preferred to others for describing indexed items; e.g., { 74}
SPHENISCIFORMES
    USE  PENGUINS

As input to string indexing software, a thesaurus most often serves in specifying what cross-references are required in the index display.

To set up, add to, and modify a string indexing system thesaurus, the indexer or thesaurus manager may use various kinds of software. Two extremes are: 1. general-purpose software such as a text editor; 2. software specially designed for the string indexing system. NEPHIS provides an example of the first alternative, which may be suitable when the thesaurus is fairly simple; PRECIS and CIFT, examples of the second, which is much more helpful as the thesaurus structure becomes more complex.

Entries in a NEPHIS thesaurus (Craven 1978b) are arranged alphabetically.  Each entry consists of:

  1. a string of characters which might begin a NEPHIS index string (usually a single term);
  2. a connecting symbol, either "=", indicating synonymy, or "==", indicating any other kind of relationship;
  3. a valid NEPHIS input string
For example, the thesaurus entry
Operations Research=@Use? of <Mathematical <Models>>? in <Solving? of <Problems>>
defines "operations research" as "the use of mathematical models in solving of problems"; it also specifies that an index which contains any index string beginning with "Operations Research" should also contain the cross-references:
  1. Mathematical Models. Use in Solving of Problems *SEE* Operations Research
  2. Models. Mathematical - . Use in Solving of Problems *SEE* Operations Research
  3. Problems. Solving. Use of Mathematical Models *SEE* Operations Research
Likewise, the thesaurus entry
Translation==Languages
indicates that translation is related to, though not synonymous with, languages; it also specifies that an index which contains "Translation" as a lead term should also contain the cross-reference { 75}
Languages *SEE ALSO* Translation

The PRECIS software maintains its own thesaurus, and indexers are responsible for specifying additions and changes in an editing language peculiar to this software.  The following input is required to insert new information:

  1. the input code "#RI#", meaning "record input";
  2. the RIN (reference indicator number) which shows the internal address of the term or other string of characters to which cross-references are to be made;
  3. the term or other string of characters, with typographic and layout codes where appropriate;
  4. optionally, "$d" plus a definition or scope note;
  5. optionally, a list, each element of which consists of a dollar sign ("$"), a one-letter code indicating a type of relationship, and the RIN of a term or other character string to be related to the first for the purpose of generating cross-references;
  6. a "#" indicating "end of data".
Items 3, 4, and 5 will be stored at the address specified by the RIN. For example, the input
#RI#0247847Penguins$m0245100$o0241768#
causes
Penguins$m0245100$o0241768
to be stored at address 0247847. It thereby establishes a new term "penguins" with RIN 0247847 as a preferred term related to the non-preferred synonym at address 0245100 (say, "sphenisciformes") and a narrower term (species of genus) for the term with RIN 0241768 (say, "birds").  The same command also causes complementary information to be stored at addresses 0245100 and 0241768.

To amend the information stored at an address, the message to be input consists of

  1. "#RA#" meaning "record amend"; { 76}
  2. the RIN of the location affected;
  3. a list, each element consisting of "$v" plus the data to be deleted in its first appearance (just "$v" if only an insertion is required) and "$w" plus the replacement data.
  4. "#".
E.g., the message
#RA#0245267$ve$wer#
would change the data at address 0245267 from
Vetebrates
to
Vertebrates
Any relationship recorded in the thesaurus must be deleted and replaced as a whole. An existing relationship cannot simply be recategorized. Thus, the message
#RA#0245267$v$n0243175$w$o0243175#
is required to change the type of relationship between address 0245267 and address 0243175.

To delete a record or to report the contents of a record, the input is

  1. "#RD#" for "record delete" or "#RP#" for "record report";
  2. the RIN;
  3. "#"
(Austin and Dykstra 1984, 198-275).

Because the PRECIS software does not automatically generate cross-references from the beginnings of index strings, the user must also supply another kind of ancillary input; namely, the RINs for any term or other string of characters for which cross-references are desired. These RINs are added to the input string.

3.3.3 Searcher-controllable input

An online index display can theoretically be customized to fit each different searcher and search. For this purpose, ancillary input may be made modifiable by the searcher.
     The value of customized index displays can be seen by contrasting index displays produced by more traditional means. In traditional index display production, the product must be designed to satisfy the needs of a number { 77} of searchers. Inasmuch as different kinds of index entries best satisfy different searcher needs, traditional index displays are bound to be less than optimal for some searches. Individual searchers may find that they are supplied with too much information or too little or that the information is presented in the wrong order.

The experimental NETPAD system gives some indication of the possible input role of searchers in producing an online index display.  In NETPAD, the searcher is actually required to supply some of the input and in addition can modify other ancillary input before the index display is produced.

The input required from the NETPAD searcher is a string of characters with which all the index strings are to begin. The NETPAD searcher thus determines the access terms at the time of search.

Ancillary input which the NETPAD searcher can modify includes: the subheading threshold, which helps control the formating of the index strings; the cutoff threshold, which helps control the kinds of terms that will be included in the index strings; and a table defining the different types of link. Each linktype definition assigns to the linktype: a mnemonic character; two weights ("backward" and "forward"); and two connectives ("backward" and "forward"). For example, a linktype might be defined as follows:

mnemonic: 8
forward weight: 120
forward connective: "in"
backward weight: 32
backward connective: "."
The connectives ("in" or ".") supplied by the linktype definition will be inserted by the index string generator in appropriate positions in index strings. The weights have three functions. First, they determine which types of links the index string generator will follow first in traversing a NETPAD network; hence, the weights contribute in large part to determining the order in which the terms are presented in an index string. Second, the weights function in conjunction with the cutoff threshold to determine whether certain types of links will be followed; in this way, they help to control whether or not certain terms will be included in an index string. Third, in conjunction with the subheading threshold, the weights determine where the subheading will begin.

Chapter 3 Summary

An input string contains terms representing things related in significant ways to the indexed item and may also indicate links between these terms.

The form of terms varies from one system to another; for example, in the importance attached to unambiguousness or to collocation, or in the dividing line between multiword terms and multiterm expressions. Index string { 78} generators recognize where terms begin and end in an input string and which of them are access terms by means of various combinations of automatic analysis, indexer coding, and ancillary input.

Term order, connectives, and codes may all indicate the types of links between terms in an input string. Linktype indication functions mainly to control the order of the terms in the index strings and the connectives used there.

Different index string generators are designed to recognize link structures of different degrees of complexity. Three major categories of structure are linear, tree, and network. Linear structures are quite easy to indicate. Tree structures may be coded by means of brackets, or by means of level numbers or linktype codes attached to terms; trees can also be recognized to some extent in ordinary-language input strings. Network structures may be coded by means of a separate list of all links, by means of special codes indicating additional connections in a simpler structure, or by means of codes for specifying some common non-tree structures.

To simplify a structure to fit what a given index string generator is capable of recognizing, an indexer may sacrifice one or more links, duplicate parts of the description, or create more than one input string.

In addition to the input strings themselves, other, ancillary, input is used by much string indexing software. The stoplist is the most common example of a list for categorizing elements in input strings. A thesaurus as input to string indexing software most often serves in specifying what cross-references are required in the index display. For online display purposes, ancillary input can be made modifiable by the searcher.

<-- Chapter 2: Survey of String Indexing Systems and Their Relatives Contents Chapter 4: Indexer Aids -->