Indexers need guidance for two main reasons: 1. to improve the efficiency of their indexing; and 2. to improve the quality of the index so as to help searchers. Improvement of indexer efficiency includes steering indexers away from errors, which may take time to correct, and expediting their decision-making by removing causes for hesitation. Improvement of index quality includes making index elements more predictable and producing better collocation; both require consistency.
The possible penalty of guidance is inaccurate or incomplete input. Inaccurate indexing may result if the indexer attempts to force elements of the description into a mould into which they do not fit; incomplete descriptions arise if information which does not fit a set of guidelines is omitted. In either case, searchers may be served less well in the end.
Even where the input required is relatively simple, additional documentation may help to increase consistency. Thus, a short article describing NEPHIS (Craven 1977) is considered sufficient for beginners to produce usable input strings; but adding NEPHIS coding consistently to titles or other ready-made descriptions in ordinary language needs a brief manual (Craven and Fjerestad 1981).
In the case of NETPAD, a fairly extensive technical manual deals with the many options available to users (Declerck and Craven 1983).
Organizations which use systems without published indexer manuals may have their own in-house documentation. For example, while CIFT does not have a general published manual, a "Bibliographers' Manual" gives instructions to the Modern Language Association's own indexers.
The Relational Indexing index string generator was designed more to prove a point than with practical applications in mind; but Farradane did write a pair of articles (Farradane 1980) as a manual for Relational Indexing in general.
Many, perhaps all, string indexing systems rely to some extent on the biases of ordinary language to guide the indexer. Such systems as ASI and NEPHIS actually suggest that the whole input string should be based on a phrase in ordinary language. Even so, since ordinary language varies so greatly in structure, the kind of ordinary language may be restricted. Thus, NEPHIS tells the indexer to prefer nouns and prepositions. CASIN prefers some prepositions to others where possible (Scheider 1976, p. 59) and attempts to limit the number of prepositions in a phrase to one.
"Faceted" approaches to indexer guidance are borrowed partly from faceted classification schemes and partly from elsewhere, such as from the idea of a questionnaire. A faceted approach essentially lays out a more or less rigid skeleton structure into which the indexer must insert appropriate terms. For example, PRECIS, by means of its main role codes, encourages the indexer to think of certain questions when examining a book: "What is the main process being discussed?" ("2"); "What is affected by this process?" ("1"); "What agent is responsible for the process?" ("3"); "In what environment does this happen?" ("0"). POPSI also guides the indexer to look for descriptive elements to fulfill predefined roles or answer predefined questions: "To what discipline does this item belong?"; "With what entities does this discipline deal that are important here?"; "What parts or properties of these entities are discussed?"; "What processes affect the entities or their parts or properties?". Faceted approaches may function to control input string content, term order, or, as in PRECIS and POPSI, both.{ 81}
Faceted approaches can be weakened by vagueness or inconsistency.
How [asks a critic of PRECIS (Langridge 1976)] are we supposed to understand an action category that can include shipping, road safety, mental disorders, anatomy, geography, Sunday performances, foreign relations, football, and winter? ... Why should emotional development and academic achievement be actions while research and attitudes are part or property? Why should the politics of a country be categorized differently from the law? Why are immigrants or population of a place treated as part/property while negroes (in New York) are treated as key system? If the categories are really being used precisely and consistently, how can Adam Smith in 'Adam Smith's economic theories' be agent while John Brown in 'John Brown's theories (about anything)' is key system?None of the inconsistencies in PRECIS criticized above, it should be noted, directly affects the forms of index strings seen by searchers. All of them, however, may conceivably cause confusion and uncertainty to indexers, thus lowering indexer efficiency. They are certainly a poor advertisement for the string indexing system among prospective index producers.
In a narrow field, the questions asked in a faceted approach can be much more specific. So the MLA CIFT indexer asks: "To what genres does the literature discussed belong?"; "What themes of the literature are discussed?"; and similar questions.
Faceted approaches can be mixed with uncoded ordinary-language input. For example, although Biological Abstracts' KWIC indexes basically use titles as input strings, indexers are instructed to add, in brackets, information missing from the titles in five categories: 1. nature of the investigation; 2. nature of the report; 3. organism(s) or substance(s) studied; 4. technique, method, instrumentation employed; 5. geographic names (Parkins 1963).
Any large-scale indexing operation should of course maintain its own back records as examples to new indexers; when such an archive of past decisions is to be used as an authority by indexers, it is commonly referred to as an authority list. Problems arise when new indexers do not consult the authority list properly or when the first indexing is bad, as is likely in the operation's early stages. Use of a previously produced authority list can overcome the startup problem, provided the authority list is sufficiently good { 82} and applicable. A thesaurus may serve as at least a partial authority list, or a thesaurus and an authority list may be combined into a single tool.
For PRECIS, a combined thesaurus and authority list of input strings describing books on a wide variety of subjects has been made public (British Library Automated Information Service 1979). A brief extract from this publication indicates the sort of data that it gives:
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Each input string is accessible under each of its index strings. Thus, in the extract, the first heading, "Home nursing. Sick children", is an index string providing access to the input string
Home nursing. Sick children
1769251 082010 649.8
083000$aSick children. Home nursing
690000$z11030$asick children$z21030$ahome nursing
692000$a029862x 692000$a0009598 693000$aILEA 008010
Home Office 0030791
SEE Great Britain$hHome Office 0030805
Home Office See($m) Great Britain. Home Office
0030805 Great Britain$hHome Office$m0030791
Home protection products 0232610
$n 001723x Protection
$n 0002968 Residences
$o 0002232 Industrial chemicals
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
$z11030$asick childrenCross-references are included; for example, the extract shows one from "Home Office" to "Great Britain. Home Office". Finally, thesaurus records are displayed; e.g., the last part of the extract, which indicates that cross-references would be needed for any item on "Home protection products" from "Protection", "Residences", and "Industrial chemicals". Additional numeric codes appearing include MARC field codes ("083", "692", etc.), RIN codes for specifying sets of cross-references ("0030791", "0232610"), and Dewey Decimal Classification numbers ("649.8").
$z21030$ahome nursing
Many new documents cataloged at the British Library can be assigned PRECIS input strings that are already in the authority file because similar documents have been cataloged earlier. After only two years of operation, Austin estimated the proportion of such documents at 45% (Austin 1974, p. 393); after three years, the figure had risen to 55% (Austin 1977); by March of 1985, it had reached about 78% and was still rising, though very slowly. { 83}
A form of authority list suggested for POPSI provides an alphabetical arrangement of segments of input strings; this form of authority list is known as a "classaurus". A POPSI classaurus is divided into several main "schedules": for the fundamental categories of "discipline", "entity", "action", and "property", as well as for some other common categories such as "place" and "time". For example, from the input string
PHARMACOLOGY, CHEMICAL>DRUG>ANTIBIOTIC; STIMULATION-CIRCULATORY SYSTEM>HEART: STUDY-ANIMAL>RABBIT"PHARMACOLOGY" would be assigned to the "discipline" schedule; "CHEMICAL>DRUG>ANTIBIOTIC" to the "entity" schedule; "STIMULATION-CIRCULATORY SYSTEM>HEART" to the "property" schedule; and "STUDY-ANIMAL>RABBIT" to the "action" schedule.
The elements of each schedule of a classaurus are laid out so that under each term are: narrower terms; equivalent terms, preceded by "="; and qualifying terms, preceded by "-". For example, part of a classaurus for leather technology (Devadason and Ramanujam 1982) is:
ACTION SCHEDULE
BEAM HOUSE OPERATION
CURING
BRINE CURING
=BRINING
DRYING
SALT CURING
=SALT PACK CURING
DRY SALT CURING
=DRY SALTING
RE SALTING
WET SALT CURING
=WET SALTING
PUERING
(AGENT USED)
- CHICKEN MANURE
- DOG MANURE
The British Library indexing worksheet has a large section for PRECIS indexing. Space is included for the SIN (subject identifier number) and RINs (reference indicator numbers) and for the PRECIS input string itself. Columns 1, 2, 7, 8, and 9 in the PRECIS input string are preprinted to save the indexer's time (Richmond 1981, p. 234):
690If a preprinted code is not appropriate, the indexer can cross it out and write in the correct code. For example, for the name of a country, the preprinted "a" code in column 9 is inappropriate, and a "d" code should be used instead; if a country name follows, the indexer therefore crosses out the "a" and inserts "d" immediately after:
$z 0$a
$z 0$a
.....
0$ad United States
Worksheets have also been developed for input to other string indexing software. The Relational Indexing worksheet is divided into two parts, one for the list of terms and the other for the list of links. The CASIN worksheet provides an appropriate space for each of the system's 41 "categories", with the code preprinted for each category.
An obvious way to guide indexers toward certain formulas for describing items is to have them respond to questionnaires. A CIFT indexing worksheet in fact comes close to a questionnaire. The Modern Language Association of America currently uses four worksheets for CIFT indexing: one for National Literatures, one for Language and Linguistics, one for General Literature, and one for Folklore. On each worksheet, all appropriate "facet" codes and some "role" codes are preprinted.
Figure 1 shows the MLA National Literatures worksheet.
Figure 1.
Recto.
Verso.
"Facet" codes include:
yl/"Role" codes are:
ya/
ul/
ma/
<tof "treatment of"{ 87}
<ion "influence on"
<soi "sources in"
<apo "application of"
An earlier version of a CIFT indexing worksheet also includes footnotes indicating which questions must be answered and under which circumstances; e.g., "1. Specific persons in RA facet require period (TA) and place (UA)." "Questions" which always require answers are marked with an "(a)".
Generally the worksheet or questionnaire approach presumes that the indexer is recording data on paper and that someone else is later actually inputting the data at a keyboard; electronic worksheets or questionnaires are also possible, however. PERMDEX supplies a fairly simple example: the software first prompts the indexer for a term and then asks what role code is to be attached and whether the term is an access term or not. Electronic worksheets are already widely used for library catalog systems. Some experimentation on electronic worksheets has also been carried out for NETPAD (Craven 1983b).
Of course the index-string display approach
will not work for input strings that are so badly
constructed that they cannot be processed. Also,
examining the index strings may consume
considerable time if they are long and varied and
there are many of them. Furthermore, if something
is wrong, the indexer still needs to determine what
to change in the input string.
{ 88}
Sometimes, an indexer will create an input string which seems likely to contain an error, but which might also be correct. For example, a term beginning with a character other than a letter of the alphabet might usually be expected to be a mistake. But it may not be a mistake: the nonalphabetic character may be intended to place an index entry in a classified sequence. Thus, when string indexing software encounters a likely input error it may warn the indexer, but should not reject the input.
If an error is detected or suspected, the indexer needs to get some sort of message so as to be able to correct or check the input. The simplest, and least helpful, type of message is one saying simply that an input string has been rejected; the Relational Indexing software gives this type of message. More helpful are messages giving the type of error or pointing to the error's location. PERMDEX and UTLAS PRECIS supply examples of the first and NEPHIS and CASIN of both.
The UTLAS version of the PRECIS software checks for 36 different error conditions in an input string. Thirty of these error conditions lead to "customer messages", which are sent to the person or organization responsible for the input string. The remainder lead to "diagnostic messages", which are kept for internal use. Customer messages cover such errors as invalid codes, incorrect combinations of codes, and the absence of required elements. Examples of the last are an input string containing no terms at all or one not starting with a "0", "1", or "2" role code in column 3. Diagnostic messages arise when an input string requires more space in the computer's memory than was allowed by the programmer (Cain 1984).
The PERMDEX software not only checks for coding errors but also prompts the indexer for corrections. Specifically, if the input string does not contain at least one term with one of the three most important role codes, the indexer is prompted to supply the missing term. These three most important role codes are "0", "1", and "2". The PERMDEX test, however, is not equivalent to the one in UTLAS PRECIS mentioned above: { 89} PERMDEX does not require role code "0", "1", or "2" to be the first in the input string; moreover, PERMDEX's "1" is, in fact, closer in meaning to PRECIS' "3" than to PRECIS' "1".
NEPHIS software generally detects two major types of coding error: unmatched brackets and improperly terminated connectives. Where the error is detected is indicated by showing where generation of the first index string had to be abandoned. For example, given the incorrect input string
RESCUES? OF CHILDREN>? BY <DOGS>the Commodore BASIC version of the NEPHIS index string generator responds
***OUCH!***Because the NEPHIS index string generation rules are fairly simple, an experienced NEPHIS indexer should often be able to find the error in the input fairly quickly from this kind of response.
THERE IS A '>' WITHOUT A MATCHING '<'
THE FIRST PERMUTATION GOT AS FAR AS
RESCUES OF CHILDREN
Some CASIN error messages, on the other hand, are very exact about the location of errors in an input string, though the result can be more error messages than there are underlying errors. For example, in the message
10*3-10- d0272#the positions of three invalid characters are noted, while the underlying error is a single one of omitting the second "category" code in a line which should read
21*demand for //bananas#
32*Australia
"STRUCTURAL DEFECT: POSITION 1 NO DIGIT"
"STRUCTURAL DEFECT: POSITION 2 NO DIGIT"
"STRUCTURAL DEFECT: POSITION 3 NO SPECIAL CHAR."
51*Economics#
71*21 Bananas#
81*21#
51*21 Economics#
While coding errors are the errors for which
string indexing software most often checks, errors
in terms may also be caught. Thus, the CASIN
software will reject any input string specifying an
access term which is not in its master list of
permitted headings (Schneider 1976, pp. 162-163).
{ 90}
Some errors in input terms can be corrected automatically by using a thesaurus or authority file. The software can look up the input terms to see whether they have equivalent preferred terms and if so substitute the preferred terms automatically. Even if the exact terms cannot be looked up, the software might try to match an input term to a similar term in the thesaurus according to some set of rules. CIFT takes this approach (Modern Language Association 1982, pp. 12-13). CIFT's general rule for matching terms is to transform both to lower case, strip them of accents, and treat hyphens as spaces; some additional rules are used on titles and author names. More sophisticated matching is theoretically possible. For example, automatic spelling correction procedures could be applied to string indexing input.
Some errors in coding can also be corrected automatically. Indeed, interactive error-correcting software may even appear to prevent the indexer from creating incorrect input strings in the first place. NEPHIS coding rules are simple enough that it has been possible to devise an online screen editing program which does this for a number of types of coding error (Craven 1983a).
In general, the NEPHIS screen editing program takes two approaches against errors. First, under certain conditions, if the indexer presses a key for a character whose addition to the input string would result in an error, then the program simply ignores it; when this happens, the key appears "dead" to the indexer. Second, a character which by itself would create an error may be inserted in the input string in reverse field (black on white instead of white on black); it then remains in reverse field until one or more other characters which make it correct are inserted. Any reverse-field characters remaining in an input string are recognized by the editing program as having no value as NEPHIS coding symbols; they will be discarded whenever the input string is stored for later processing by the NEPHIS program.
For example, suppose the indexer is editing the phrase
COAL MINES IN CANADAand tries to insert a left bracket before "MINES" (to provide access under this term). The result will be
COAL <MINES IN CANADAThat is, the left bracket will be in reverse field because a matching right bracket is still needed to avoid an error. As soon as the indexer inserts the missing right bracket, the reverse-field left bracket is changed to an ordinary left bracket; for example,
{ 91}
COAL <MINES> IN CANADA
An example of automatic coding suggestion is the automatic flagging-of-headings option in the OLPI program (Baser and others 1978), an online version of ASI. Using OLPI, the indexer, after first typing in a descriptive phrase, normally indicates by number terms in the phrase which are to be access point terms. In the automatic flagging option, however, the program suggests the access points and the indexer can accept or reject the suggestions.
More powerful aid is provided by the NEPHIS automatic coder (Craven 1982a). Using two short lists, a stoplist and a list of connectives, this program crudely analyzes title-like phrases and adds the NEPHIS coding symbols "<", ">", and "@". For example, given the title
PROMOTION OF INFORMATION SERVICES: AN EVALUATION OF ALTERNATIVE APPROACHESthe NEPHIS automatic coder will produce the input string
PROMOTION OF <INFORMATION <SERVICES>>: <@AN <EVALUATION> OF <ALTERNATIVE <APPROACHES>>>NEPHIS benefits here from its heavy reliance on the structure of ordinary language. Automatic coding for PRECIS has been suggested, but does not seem feasible because PRECIS is much more complex and has many more coding requirements. Even the NEPHIS automatic coder is somewhat limited: it does not add the "?" symbol and it produces unsatisfactory results for certain phrases, especially for some containing coordinating conjunctions, titles or corporate names, or non-English words. { 92}
sleep /; researchers /; communication [informal]An indexer can use a hand-made graphic display like this as a source in making up the complete input string for computer processing:
/: /;
research /+ productivity
v=1;s=sleepThe Relational Indexing indexer must carry out the translation from diagram to input string by hand and must in the process add information not given in the diagram.
v=1;s=researchers
v=1;s=communication [informal]
v=1;s=productivity
v=1;s=research
l=1;w=1;r=7;p=on;w=2
w=2;r=7;l=1;w=3
w=3;p=related to;r=7;w=4
w=5;r=5;p=of;w=4
g=1;w=2;r=9;p=by;l=1;w=5
Working in the opposite direction, the NEPHTREE program (Craven 1980a) takes a NEPHIS input string, or an indexer's attempt at a NEPHIS input string, and automatically displays the string's tree structure.
A still better aid is continuous graphic
display while the indexer is creating input. The
Commodore BASIC version of NETPAD provides an
example: a special online screen editor which
automatically, as the indexer types in the correct
commands, both draws the two-dimensional display
and stores the data needed to build the string. For
instance, once an indexer has used the NETPAD
editor to create the display
{ 93}
COMMUNICATION (INFORMAL)simply pressing one key will store the corresponding NETPAD input string.
|/^
| PRODUCTIVITY
| /(
| RESEARCH
/[ /[
------RESEARCHERS
/^
SLEEP
Double-KWIC provides an illustration of how string indexing software may supply aid to an index producer for generating ancillary input. The Double-KWIC software could recognize many possible headings for index strings in a single input string, far more than would be useful for a single index display. For example, the possible headings from the title
DASAR: COMPUTER-BASED DATA STORAGE AND DATA RETRIEVALwould be
COMPUTER
COMPUTER BASED
COMPUTER BASED DATA
DASAR
DATA
DATA STORAGE
RETRIEVAL
STORAGE
STORAGE AND DATA
To restrict the possibilities before the index is produced, the software presents the index producer with an alphabetized list of all such possible { 94} headings and their frequencies. The golist of allowed headings is then compiled by the index producer's selecting headings from the list by sequence number.
Documentation in the form of manuals provides one kind of guidance. Many string indexing systems also rely to some extent on the biases of ordinary language. Faceted approaches lay out more rigid structures into which the indexer inserts appropriate terms.
Past decisions may be collected into an authority list, which often is, or is combined with, a thesaurus; examples are the PRECIS authority list and POPSI's classaurus.
Forms or worksheets are another useful aid, especially when much coding is required, as in PRECIS, or when a faceted approach is taken, as in CIFT. Electronic forms or worksheets are becoming more common.
Software may assist indexers in a number of ways. The index strings resulting from a given input string may be displayed. Errors, especially in coding, may be detected and their type and position indicated. Some errors in terms or coding may be corrected automatically. Suitable coding may be suggested. Either the software or the indexer may produce graphical displays of input string structures.
Apart from the indexer, other people involved in producing an index display may also benefit from assistance, as the indexer producer does in Double-KWIC.
| <-- Chapter 3: Input | Contents | Chapter 5: The Syntax of Index Strings --> |