{ 79}

CHAPTER 4
INDEXER AIDS

This chapter will illustrate and discuss the value of various types of guidance and other assistance provided to indexers using string indexing systems. It will begin with printed tools, such as manuals and worksheets, and then move on to computerized error detection and correction, automatic coding, and graphic displays.

Indexers need guidance for two main reasons: 1. to improve the efficiency of their indexing; and 2. to improve the quality of the index so as to help searchers. Improvement of indexer efficiency includes steering indexers away from errors, which may take time to correct, and expediting their decision-making by removing causes for hesitation. Improvement of index quality includes making index elements more predictable and producing better collocation; both require consistency.

The possible penalty of guidance is inaccurate or incomplete input. Inaccurate indexing may result if the indexer attempts to force elements of the description into a mould into which they do not fit; incomplete descriptions arise if information which does not fit a set of guidelines is omitted. In either case, searchers may be served less well in the end.

4.1 MANUALS AND GUIDELINES

The more complex input a system requires from an indexer, the greater is the need for documentation telling how to construct the input. Thus, on the one hand, systems like KWIC and KWOC need little in the way of manuals; on the other hand, PRECIS, not surprisingly, has an extensive manual of instructions (Austin 1974a; Austin and Dykstra 1984), as well { 80} as an intermediate work for North American indexers (Richmond 1981) and a programmed text (Ramsden 1981).

Even where the input required is relatively simple, additional documentation may help to increase consistency. Thus, a short article describing NEPHIS (Craven 1977) is considered sufficient for beginners to produce usable input strings; but adding NEPHIS coding consistently to titles or other ready-made descriptions in ordinary language needs a brief manual (Craven and Fjerestad 1981).

In the case of NETPAD, a fairly extensive technical manual deals with the many options available to users (Declerck and Craven 1983).

Organizations which use systems without published indexer manuals may have their own in-house documentation. For example, while CIFT does not have a general published manual, a "Bibliographers' Manual" gives instructions to the Modern Language Association's own indexers.

The Relational Indexing index string generator was designed more to prove a point than with practical applications in mind; but Farradane did write a pair of articles (Farradane 1980) as a manual for Relational Indexing in general.

Many, perhaps all, string indexing systems rely to some extent on the biases of ordinary language to guide the indexer. Such systems as ASI and NEPHIS actually suggest that the whole input string should be based on a phrase in ordinary language. Even so, since ordinary language varies so greatly in structure, the kind of ordinary language may be restricted. Thus, NEPHIS tells the indexer to prefer nouns and prepositions. CASIN prefers some prepositions to others where possible (Scheider 1976, p. 59) and attempts to limit the number of prepositions in a phrase to one.

"Faceted" approaches to indexer guidance are borrowed partly from faceted classification schemes and partly from elsewhere, such as from the idea of a questionnaire. A faceted approach essentially lays out a more or less rigid skeleton structure into which the indexer must insert appropriate terms. For example, PRECIS, by means of its main role codes, encourages the indexer to think of certain questions when examining a book: "What is the main process being discussed?" ("2"); "What is affected by this process?" ("1"); "What agent is responsible for the process?" ("3"); "In what environment does this happen?" ("0"). POPSI also guides the indexer to look for descriptive elements to fulfill predefined roles or answer predefined questions: "To what discipline does this item belong?"; "With what entities does this discipline deal that are important here?"; "What parts or properties of these entities are discussed?"; "What processes affect the entities or their parts or properties?". Faceted approaches may function to control input string content, term order, or, as in PRECIS and POPSI, both.{ 81}

Faceted approaches can be weakened by vagueness or inconsistency.

How [asks a critic of PRECIS (Langridge 1976)] are we supposed to understand an action category that can include shipping, road safety, mental disorders, anatomy, geography, Sunday performances, foreign relations, football, and winter? ... Why should emotional development and academic achievement be actions while research and attitudes are part or property? Why should the politics of a country be categorized differently from the law? Why are immigrants or population of a place treated as part/property while negroes (in New York) are treated as key system? If the categories are really being used precisely and consistently, how can Adam Smith in 'Adam Smith's economic theories' be agent while John Brown in 'John Brown's theories (about anything)' is key system?
None of the inconsistencies in PRECIS criticized above, it should be noted, directly affects the forms of index strings seen by searchers. All of them, however, may conceivably cause confusion and uncertainty to indexers, thus lowering indexer efficiency. They are certainly a poor advertisement for the string indexing system among prospective index producers.

In a narrow field, the questions asked in a faceted approach can be much more specific. So the MLA CIFT indexer asks: "To what genres does the literature discussed belong?"; "What themes of the literature are discussed?"; and similar questions.

Faceted approaches can be mixed with uncoded ordinary-language input. For example, although Biological Abstracts' KWIC indexes basically use titles as input strings, indexers are instructed to add, in brackets, information missing from the titles in five categories: 1. nature of the investigation; 2. nature of the report; 3. organism(s) or substance(s) studied; 4. technique, method, instrumentation employed; 5. geographic names (Parkins 1963).

4.2 AUTHORITY LISTS

String indexing may be taught by example as well as by precept. Indexers may be helped to produce good input by being shown good input that other indexers have already produced.

Any large-scale indexing operation should of course maintain its own back records as examples to new indexers; when such an archive of past decisions is to be used as an authority by indexers, it is commonly referred to as an authority list. Problems arise when new indexers do not consult the authority list properly or when the first indexing is bad, as is likely in the operation's early stages. Use of a previously produced authority list can overcome the startup problem, provided the authority list is sufficiently good { 82} and applicable. A thesaurus may serve as at least a partial authority list, or a thesaurus and an authority list may be combined into a single tool.

For PRECIS, a combined thesaurus and authority list of input strings describing books on a wide variety of subjects has been made public (British Library Automated Information Service 1979). A brief extract from this publication indicates the sort of data that it gives:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Home nursing. Sick children
1769251                           082010     649.8
083000$aSick children. Home nursing
690000$z11030$asick children$z21030$ahome nursing
692000$a029862x 692000$a0009598 693000$aILEA 008010

Home  Office                               0030791
SEE Great Britain$hHome Office 0030805

Home Office See($m) Great Britain. Home Office
0030805 Great Britain$hHome Office$m0030791

Home  protection products                   0232610
$n 001723x Protection
$n 0002968 Residences
$o 0002232 Industrial chemicals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Each input string is accessible under each of its index strings. Thus, in the extract, the first heading, "Home nursing. Sick children", is an index string providing access to the input string
$z11030$asick children
$z21030$ahome nursing
Cross-references are included; for example, the extract shows one from "Home Office" to "Great Britain. Home Office". Finally, thesaurus records are displayed; e.g., the last part of the extract, which indicates that cross-references would be needed for any item on "Home protection products" from "Protection", "Residences", and "Industrial chemicals". Additional numeric codes appearing include MARC field codes ("083", "692", etc.), RIN codes for specifying sets of cross-references ("0030791", "0232610"), and Dewey Decimal Classification numbers ("649.8").

Many new documents cataloged at the British Library can be assigned PRECIS input strings that are already in the authority file because similar documents have been cataloged earlier. After only two years of operation, Austin estimated the proportion of such documents at 45% (Austin 1974, p. 393); after three years, the figure had risen to 55% (Austin 1977); by March of 1985, it had reached about 78% and was still rising, though very slowly. { 83}

A form of authority list suggested for POPSI provides an alphabetical arrangement of segments of input strings; this form of authority list is known as a "classaurus". A POPSI classaurus is divided into several main "schedules": for the fundamental categories of "discipline", "entity", "action", and "property", as well as for some other common categories such as "place" and "time". For example, from the input string

PHARMACOLOGY, CHEMICAL>DRUG>ANTIBIOTIC; STIMULATION-CIRCULATORY SYSTEM>HEART: STUDY-ANIMAL>RABBIT
"PHARMACOLOGY" would be assigned to the "discipline" schedule; "CHEMICAL>DRUG>ANTIBIOTIC" to the "entity" schedule; "STIMULATION-CIRCULATORY SYSTEM>HEART" to the "property" schedule; and "STUDY-ANIMAL>RABBIT" to the "action" schedule.

The elements of each schedule of a classaurus are laid out so that under each term are: narrower terms; equivalent terms, preceded by "="; and qualifying terms, preceded by "-".  For example, part of a classaurus for leather technology (Devadason and Ramanujam 1982) is:

ACTION SCHEDULE

BEAM HOUSE OPERATION
 CURING
  BRINE CURING
  =BRINING
  DRYING
  SALT CURING
  =SALT PACK CURING
   DRY SALT CURING
   =DRY SALTING
   RE SALTING
   WET SALT CURING
   =WET SALTING
 PUERING
  (AGENT USED)
   - CHICKEN MANURE
   - DOG MANURE

4.3 WORKSHEETS

When computer software requires input with a good deal of coding, worksheets can be very useful for the people who compose this input. Worksheets can be especially useful when the coding tends to repeat itself { 84} in the same pattern, such as the "$z..030$a" pattern recurring so often in PRECIS input strings.

The British Library indexing worksheet has a large section for PRECIS indexing. Space is included for the SIN (subject identifier number) and RINs (reference indicator numbers) and for the PRECIS input string itself. Columns 1, 2, 7, 8, and 9 in the PRECIS input string are preprinted to save the indexer's time (Richmond 1981, p. 234):

690
$z           0$a
$z           0$a
.....
If a preprinted code is not appropriate, the indexer can cross it out and write in the correct code. For example, for the name of a country, the preprinted "a" code in column 9 is inappropriate, and a "d" code should be used instead; if a country name follows, the indexer therefore crosses out the "a" and inserts "d" immediately after:
0$ad United States

Worksheets have also been developed for input to other string indexing software. The Relational Indexing worksheet is divided into two parts, one for the list of terms and the other for the list of links. The CASIN worksheet provides an appropriate space for each of the system's 41 "categories", with the code preprinted for each category.

An obvious way to guide indexers toward certain formulas for describing items is to have them respond to questionnaires. A CIFT indexing worksheet in fact comes close to a questionnaire. The Modern Language Association of America currently uses four worksheets for CIFT indexing: one for National Literatures, one for Language and Linguistics, one for General Literature, and one for Folklore. On each worksheet, all appropriate "facet" codes and some "role" codes are preprinted.

Figure 1 shows the MLA National Literatures worksheet.

Figure 1.
Recto.
MLA Bibliography Worksheet (recto)
Verso.
MLA Bibliography Worksheet (verso)
"Facet" codes include:

yl/
ya/
ul/
ma/
"Role" codes are:
<tof     "treatment of"
<ion     "influence on"
<soi     "sources in"
<apo     "application of"
{ 87}
The "i" following "ul/" indicates a term which appears only in the index and not in the classified sequence. The indexer can choose English, French, German, Spanish, Italian, or Russian as the document language simply by circling the appropriate abbreviation. The list of permitted "role" codes appears on all the worksheets, rather than on a separate sheet as for PRECIS.

An earlier version of a CIFT indexing worksheet also includes footnotes indicating which questions must be answered and under which circumstances; e.g., "1. Specific persons in RA facet require period (TA) and place (UA)." "Questions" which always require answers are marked with an "(a)".

Generally the worksheet or questionnaire approach presumes that the indexer is recording data on paper and that someone else is later actually inputting the data at a keyboard; electronic worksheets or questionnaires are also possible, however. PERMDEX supplies a fairly simple example: the software first prompts the indexer for a term and then asks what role code is to be attached and whether the term is an access term or not.  Electronic worksheets are already widely used for library catalog systems. Some experimentation on electronic worksheets has also been carried out for NETPAD (Craven 1983b).

4.4 INDEX STRING DISPLAY

A fairly simple way for string indexing software to assist the indexer is to display the index strings that result from each input string. The indexer can then examine the index strings to see whether they seem to provide useful access to the indexed item; that is, whether the access terms appear appropriate, whether the correct meaning is conveyed, and so on.  For example, the MLA CIFT validation software prints for each input string one complete index string plus a list of the headings of all the index strings (Modern Language Association 1982, p. 13). At UTLAS, software, run monthly, displays not only a complete set of index strings for each new PRECIS input string, but also all the "see" and "see also" cross-references required (Cain 1984).

Of course the index-string display approach will not work for input strings that are so badly constructed that they cannot be processed. Also, examining the index strings may consume considerable time if they are long and varied and there are many of them. Furthermore, if something is wrong, the indexer still needs to determine what to change in the input string.
{ 88}

4.5 ERROR DETECTION

String indexer errors occur in a number of varieties. At one extreme are subtle errors in completeness, accuracy, or consistency in analyzing the significance of indexed items; such errors are often more in the minds of other indexers than objectively definable. At the other extreme are fatal coding errors; that is, errors which make the input string unusable by the index string generator even for producing bad index strings. Between the two extremes, indexers misspell terms, fail to use preferred terms, or code structures which they do not intend. Error detection software is clearly easier to design the closer the errors to be detected are to the second extreme.

Sometimes, an indexer will create an input string which seems likely to contain an error, but which might also be correct. For example, a term beginning with a character other than a letter of the alphabet might usually be expected to be a mistake. But it may not be a mistake: the nonalphabetic character may be intended to place an index entry in a classified sequence. Thus, when string indexing software encounters a likely input error it may warn the indexer, but should not reject the input.

If an error is detected or suspected, the indexer needs to get some sort of message so as to be able to correct or check the input. The simplest, and least helpful, type of message is one saying simply that an input string has been rejected; the Relational Indexing software gives this type of message. More helpful are messages giving the type of error or pointing to the error's location. PERMDEX and UTLAS PRECIS supply examples of the first and NEPHIS and CASIN of both.

The UTLAS version of the PRECIS software checks for 36 different error conditions in an input string. Thirty of these error conditions lead to "customer messages", which are sent to the person or organization responsible for the input string. The remainder lead to "diagnostic messages", which are kept for internal use. Customer messages cover such errors as invalid codes, incorrect combinations of codes, and the absence of required elements. Examples of the last are an input string containing no terms at all or one not starting with a "0", "1", or "2" role code in column 3. Diagnostic messages arise when an input string requires more space in the computer's memory than was allowed by the programmer (Cain 1984).

The PERMDEX software not only checks for coding errors but also prompts the indexer for corrections. Specifically, if the input string does not contain at least one term with one of the three most important role codes, the indexer is prompted to supply the missing term. These three most important role codes are "0", "1", and "2". The PERMDEX test, however, is not equivalent to the one in UTLAS PRECIS mentioned above: { 89} PERMDEX does not require role code "0", "1", or "2" to be the first in the input string; moreover, PERMDEX's "1" is, in fact, closer in meaning to PRECIS' "3" than to PRECIS' "1".

NEPHIS software generally detects two major types of coding error: unmatched brackets and improperly terminated connectives. Where the error is detected is indicated by showing where generation of the first index string had to be abandoned. For example, given the incorrect input string

RESCUES? OF CHILDREN>? BY <DOGS>
the Commodore BASIC version of the NEPHIS index string generator responds
***OUCH!***

THERE IS A '>' WITHOUT A MATCHING '<'

THE FIRST PERMUTATION GOT AS FAR AS

RESCUES OF CHILDREN
Because the NEPHIS index string generation rules are fairly simple, an experienced NEPHIS indexer should often be able to find the error in the input fairly quickly from this kind of response.

Some CASIN error messages, on the other hand, are very exact about the location of errors in an input string, though the result can be more error messages than there are underlying errors. For example, in the message

   10*3-10- d0272#
   21*demand for //bananas#
   32*Australia
"STRUCTURAL DEFECT: POSITION  1 NO DIGIT"
"STRUCTURAL DEFECT: POSITION  2 NO DIGIT"
"STRUCTURAL DEFECT: POSITION  3 NO SPECIAL CHAR."
   51*Economics#
   71*21 Bananas#
   81*21#
the positions of three invalid characters are noted, while the underlying error is a single one of omitting the second "category" code in a line which should read
51*21 Economics#

While coding errors are the errors for which string indexing software most often checks, errors in terms may also be caught. Thus, the CASIN software will reject any input string specifying an access term which is not in its master list of permitted headings (Schneider 1976, pp. 162-163).
{ 90}

4.6 ERROR CORRECTION

Rather than the indexer's making mistakes and then having to correct them, it is better for the indexer if the software corrects the mistakes automatically. Correcting errors automatically, however, is not usually as simple as detecting them automatically. Thus, it is not surprising that relatively little work has been done here in relation to string indexing.

Some errors in input terms can be corrected automatically by using a thesaurus or authority file. The software can look up the input terms to see whether they have equivalent preferred terms and if so substitute the preferred terms automatically. Even if the exact terms cannot be looked up, the software might try to match an input term to a similar term in the thesaurus according to some set of rules. CIFT takes this approach (Modern Language Association 1982, pp. 12-13). CIFT's general rule for matching terms is to transform both to lower case, strip them of accents, and treat hyphens as spaces; some additional rules are used on titles and author names. More sophisticated matching is theoretically possible. For example, automatic spelling correction procedures could be applied to string indexing input.

Some errors in coding can also be corrected automatically. Indeed, interactive error-correcting software may even appear to prevent the indexer from creating incorrect input strings in the first place. NEPHIS coding rules are simple enough that it has been possible to devise an online screen editing program which does this for a number of types of coding error (Craven 1983a).

In general, the NEPHIS screen editing program takes two approaches against errors.  First, under certain conditions, if the indexer presses a key for a character whose addition to the input string would result in an error, then the program simply ignores it; when this happens, the key appears "dead" to the indexer. Second, a character which by itself would create an error may be inserted in the input string in reverse field (black on white instead of white on black); it then remains in reverse field until one or more other characters which make it correct are inserted. Any reverse-field characters remaining in an input string are recognized by the editing program as having no value as NEPHIS coding symbols; they will be discarded whenever the input string is stored for later processing by the NEPHIS program.

For example, suppose the indexer is editing the phrase

COAL MINES IN CANADA
and tries to insert a left bracket before "MINES" (to provide access under this term). The result will be
COAL <MINES IN CANADA
{ 91}
That is, the left bracket will be in reverse field because a matching right bracket is still needed to avoid an error. As soon as the indexer inserts the missing right bracket, the reverse-field left bracket is changed to an ordinary left bracket; for example,
COAL <MINES> IN CANADA

4.7 AUTOMATIC CODING

In some string indexing systems with coded input strings, software can be written to aid the indexer by suggesting what coding to add to a phrase in ordinary language describing an item. It is fairly obvious that the indexer using such software has an advantage over indexers using the same string index systems without automatic coding suggestion. But there is also an advantage over systems with ordinary-language input strings; namely, a chance to make changes in the suggested coding of the input strings before the index strings are produced.

An example of automatic coding suggestion is the automatic flagging-of-headings option in the OLPI program (Baser and others 1978), an online version of ASI. Using OLPI, the indexer, after first typing in a descriptive phrase, normally indicates by number terms in the phrase which are to be access point terms. In the automatic flagging option, however, the program suggests the access points and the indexer can accept or reject the suggestions.

More powerful aid is provided by the NEPHIS automatic coder (Craven 1982a). Using two short lists, a stoplist and a list of connectives, this program crudely analyzes title-like phrases and adds the NEPHIS coding symbols "<", ">", and "@". For example, given the title

PROMOTION OF INFORMATION SERVICES: AN EVALUATION OF ALTERNATIVE APPROACHES
the NEPHIS automatic coder will produce the input string
PROMOTION OF <INFORMATION <SERVICES>>: <@AN <EVALUATION> OF <ALTERNATIVE <APPROACHES>>>
NEPHIS benefits here from its heavy reliance on the structure of ordinary language. Automatic coding for PRECIS has been suggested, but does not seem feasible because PRECIS is much more complex and has many more coding requirements. Even the NEPHIS automatic coder is somewhat limited: it does not add the "?" symbol and it produces unsatisfactory results for certain phrases, especially for some containing coordinating conjunctions, titles or corporate names, or non-English words. { 92}

4.8 GRAPHIC DISPLAYS

When an indexer needs to describe an item by means of a fairly complex structure of terms and links between terms, a two-dimensional picture of the structure may be very useful. The network diagrams in this book are only one example.
     Relational Indexing started as a manual system, in which the indexer drew a structure for each item by hand or by using a typewriter; for example, for an article on "the relationship of informal communication among sleep researchers to their research productivity",
sleep /; researchers /; communication [informal]
            /:             /;
          research   /+  productivity
An indexer can use a hand-made graphic display like this as a source in making up the complete input string for computer processing:
v=1;s=sleep
v=1;s=researchers
v=1;s=communication [informal]
v=1;s=productivity
v=1;s=research
l=1;w=1;r=7;p=on;w=2
w=2;r=7;l=1;w=3
w=3;p=related to;r=7;w=4
w=5;r=5;p=of;w=4
g=1;w=2;r=9;p=by;l=1;w=5
The Relational Indexing indexer must carry out the translation from diagram to input string by hand and must in the process add information not given in the diagram.

Working in the opposite direction, the NEPHTREE program (Craven 1980a) takes a NEPHIS input string, or an indexer's attempt at a NEPHIS input string, and automatically displays the string's tree structure.

A still better aid is continuous graphic display while the indexer is creating input. The Commodore BASIC version of NETPAD provides an example: a special online screen editor which automatically, as the indexer types in the correct commands, both draws the two-dimensional display and stores the data needed to build the string. For instance, once an indexer has used the NETPAD editor to create the display
{ 93}

COMMUNICATION (INFORMAL)
|/^
|  PRODUCTIVITY
|   /(
|     RESEARCH
 /[    /[
   ------RESEARCHERS
          /^
            SLEEP
simply pressing one key will store the corresponding NETPAD input string.

4.8 AIDS TO OTHERS INVOLVED WITH THE INDEX

Although this chapter is basically about aids to the indexer, other people involved in producing an index display will also benefit from certain kinds of assistance. One such person can be called the index producer: while the indexer is responsible for descriptions of individual indexed items, the index producer is reponsible for producing the index as a whole.

Double-KWIC provides an illustration of how string indexing software may supply aid to an index producer for generating ancillary input. The Double-KWIC software could recognize many possible headings for index strings in a single input string, far more than would be useful for a single index display. For example, the possible headings from the title

DASAR: COMPUTER-BASED DATA STORAGE AND DATA RETRIEVAL
would be
COMPUTER
COMPUTER BASED
COMPUTER BASED DATA
DASAR
DATA
DATA STORAGE
RETRIEVAL
STORAGE
STORAGE AND DATA

To restrict the possibilities before the index is produced, the software presents the index producer with an alphabetized list of all such possible { 94} headings and their frequencies. The golist of allowed headings is then compiled by the index producer's selecting headings from the list by sequence number.

Chapter 4 Summary

Indexers need guidance to improve both their own efficiency and the quality of the index. A possible penalty of guidance, however, is inaccurate or incomplete input.

Documentation in the form of manuals provides one kind of guidance. Many string indexing systems also rely to some extent on the biases of ordinary language. Faceted approaches lay out more rigid structures into which the indexer inserts appropriate terms.

Past decisions may be collected into an authority list, which often is, or is combined with, a thesaurus; examples are the PRECIS authority list and POPSI's classaurus.

Forms or worksheets are another useful aid, especially when much coding is required, as in PRECIS, or when a faceted approach is taken, as in CIFT. Electronic forms or worksheets are becoming more common.

Software may assist indexers in a number of ways. The index strings resulting from a given input string may be displayed. Errors, especially in coding, may be detected and their type and position indicated. Some errors in terms or coding may be corrected automatically. Suitable coding may be suggested. Either the software or the indexer may produce graphical displays of input string structures.

Apart from the indexer, other people involved in producing an index display may also benefit from assistance, as the indexer producer does in Double-KWIC.

<-- Chapter 3: Input Contents Chapter 5: The Syntax of Index Strings -->