{ 95}

CHAPTER 5
THE SYNTAX OF INDEX STRINGS

The syntax of an index string includes both the order in which the terms appear in it and the connectives used to indicate links between the terms. The first aspect, which is referred to as citation order, is the major topic of this chapter, while the final section is devoted to the role of connectives.

5.1 PRINCIPLES OF CITATION ORDER

In string indexing, the overriding rule of citation order is that each of the various access terms in the input string is in turn the lead term of one or more index strings. This rule, however, still leaves undecided the order of the terms after the lead term. The possibility of accepting every possible citation order of these terms tends to be excluded by the number of index strings that results, unless the number of terms per index string is greatly restricted. Other rules of citation order must thus also be invoked if the index string contains more than two or three terms.

Most existing theory of citation order is oriented toward systems in which only one citation order is required for any retrievable item; principally, toward traditional library classification systems. Thus, it does not take account of the need, in string indexing, for the order of terms to vary systematically depending on which is the lead term.

Existing theory can, nevertheless, be applied to discussion of string indexing citation order. This application is especially justified by the way in which most string indexing systems look at the citation orders of index strings: they { 96} emphasize the citation order of the input string, treating it like a single-entry citation order, and view the citation orders of the index strings as produced by manipulation of this basic order.

What are referred to as the principles of citation order are actually general rules which ought to promote good qualities in indexes or other retrieval tools. In string indexing, these rules should be justified at a more basic level by their contributions to good qualities of index strings; that is, for the most part, to predictability, collocation, clarity, and eliminability. Ultimately, of course, their application should create a good string index, one which searchers search more efficiently and effectively.

A number of principles of citation order have been suggested. Consistent application of only one principle yields better predictability for experienced searchers provided the searchers are able to grasp and apply the principle. On the other hand, more than one principle is usually required to determine the order of terms in a more complex description. Moreover, different principles tend to promote different good qualities in index strings, qualities whose importance varies from one place in the index string to another and from one place in the index to another. The principles do, however, often agree, and some can actually be justified on the basis of others.

The most developed theories of single-entry citation order are probably those of Ranganathan (Ranganathan 1964, 1967) and of Coates (Coates 1960, pp. 50-58). Ranganathan suggests that there may be an "absolute syntax" for subject descriptions: a sequence in which the ideas "arrange themselves in the minds of a majority of persons" regardless of which ordinary human language they speak. He is responsible for originating a number of principles, including the picturesquely named "wall-picture" and "cow-calf" principles. Coates' most important method is to develop a set of linktypes and determine for each linktype which of the two linked terms should be cited first. The set of linktypes developed by Coates was employed in determining the citation order in index entries and cross-references in the chain indexes of British Technology Index.

5.1.1 Context dependency

The most prominent citation order principle for string indexing systems is probably PRECIS' "context-dependency" principle. Before looking at this principle as a whole, it will be helpful to consider two separate principles of context and dependency. The first of these governs what terms are placed together in a citation order; the second, which is placed first.

The principle of context can be thought of as stating that a term should be adjoined by those other terms which serve most to narrow its scope or { 97} to qualify it; in other words, which provide the most context for understanding it. For example, in describing an item on "policies of the Australian government on foreign trade with developing countries", the term "GOVERNMENT" should be adjoined by "AUSTRALIA". Thus, in

AUSTRALIA.  GOVERNMENT. POLICIES. FOREIGN TRADE. DEVELOPING COUNTRIES.
the government is readily understood to be that of Australia and the policies those of the Australian Government. By contrast, changing the order to
AUSTRALIA.  FOREIGN TRADE. DEVELOPING COUNTRIES. POLICIES. GOVERNMENT.
might suggest that the policies belong to the governments of the developing countries.

Application of the context principle should contribute to overall clarity. It should also increase eliminability, by providing searchers with more immediate information on whether a term refers to things of interest to them.

A possible corollary following from the context principle is that pairs of terms which have the "strongest" links between them in a description should be adjacent in the citation order. A related idea, suggested by Austin (Austin 1982), is that searchers normally interpret adjacent terms as being linked in the strongest way possible. Austin categorizes the strength of linktypes with the formula:

Be > Have > Do > Locate
According to this formula, "Teachers. Women" is more likely to be interpreted as "Teachers who are women" (to be) than "Teachers who teach women" (to do); and "Teaching. Costs", as "Costs of teaching" (to have) than "Teaching the subject of costs" (to do).

The basic idea of the dependency principle is that the more "dependent" of two linked terms should normally be cited after the less dependent. Dependency can be interpreted in various ways; for example, the dependent term may represent an abstract property or something which has to follow in time. Another way of stating the principle might be to say that the dependent term is relatively more in need of qualification by the other term than vice versa. For example, the dependency principle would require the order

AUSTRALIA. GOVERNMENT
rather than
GOVERNMENT. AUSTRALIA
{ 98} The scope of the term "AUSTRALIA" is scarcely restricted by the inclusion of the term "GOVERNMENT", because, in any case, "AUSTRALIA" still refers to the same geopolitical entity. By contrast, the term "GOVERNMENT", which alone could refer to any and all governments, is restricted, when coupled with the term "AUSTRALIA", to referring to the much narrower class of Australian governments.

Application of the dependency principle can be seen as contributing especially to eliminability. The more the dependency principle is applied, the less the meaning of a term in an index string is qualified by any of the terms that follow it; hence, the easier it is for searchers to break off reading an index string at any term. For example, the term sequence

FRANCE. NORTHERN REGIONS
follows the dependency principle, and searchers can break off at either term. By contrast, in the sequence
NORTHERN REGIONS. FRANCE
a searcher is unlikely to break off at "NORTHERN REGIONS", because this term means little except as qualified by some term indicating a whole.

Forms of dependency principle are widely applied in single-entry systems. A number of Ranganathan's principles provide examples, as do most of the linktypes in Coates' system.

PRECIS' "context-dependency" principle may be seen as a combination of context and dependency. When this principle is followed in a PRECIS input string, each term is qualified, or set in context, by the term which immediately precedes it. Each term is hence dependent, directly or indirectly, on all the terms which precede it. For example, an item described as "conference proceedings on prospecting for mineral deposits in glaciated regions" has a PRECIS input string

$z01030$aglaciated regions
$z11030$amineral deposits
$z21030$aprospecting
$z60030$aconference proceedings
Evidently, "mineral deposits" is qualified by "glaciated regions", "prospecting" by "mineral deposits", and so on. If context dependency is applied consistently, all the terms used are linked in a linear fashion, and no term is qualified directly by more than one other term.

Qualification, it should be noted, is something of a function of how the indexer sees the topic of the indexed item. For instance, if an article mentions both glaciated regions and mineral deposits in these regions, the indexer might decide that the glaciated regions were the main topical focus. Thus, the article { 99} might be described as being on "glaciated regions having mineral deposits". In this case, the qualification pattern of the example above is reversed, and it is "glaciated regions" that is the qualified term. PRECIS, like many string indexing systems, would treat the two cases quite differently.

5.1.2 Other principles of citation order

The first-coming-to-mind principle, attributable to Coates, suggests that the term first thought of by most searchers should be cited first. Application of this principle should improve predictability for searchers who think like the majority. For index strings to be predictable to other searchers, these searchers will have to imagine correctly how the majority thinks, as will indexers.

Natural-language citation order means that the terms should appear in the same order in which they would be given in a description in ordinary language. Natural-language order may rank fairly high on clarity because of its familiarity to searchers. But it may perform less well on other desirable qualities of index strings. Predictability, for example, may suffer because different ways of describing the same thing in ordinary language may cite the same terms in different orders. Nor is natural-language order a guarantee of clarity, especially if the index strings depart in other ways from ordinary language. For example, in ordinary English, on the one hand, "The dogs rescued the children" clearly means one thing and "The children rescued the dogs" another; in an index string without distinguishing connectives, on the other hand, no order of the three terms "RESCUES", "CHILDREN", and "DOGS" will make clear to inexperienced searchers who is rescuing and who is being rescued.

A variation of the natural-language order principle suggests that citation order should correspond to the normal order of English passive sentences; that is, "object - action - agent" (Austin 1976b). Restriction to passive order may give greater predictability and clarity for experienced searchers, but inexperienced searchers may still encounter difficulties without additional assistance.

The principle of "decreasing concreteness", derived from Ranaganathan, states that less concrete, more abstract, terms should be cited after more concrete terms. It may be justified by either the dependency principle or the first-coming-to-mind principle, or both.

A principle of giving precedence to more informative terms can be traced back at least as far as Cutter's work on subject headings (Cutter 1904, p. 72). Applying this principle should promote eliminability: putting more information about the indexed item early in the index string allows searchers { 100} to decide to discontinue reading the index string at an earlier point if it is not relevant.  Terms may be more informative because they need less qualification; hence applying the dependency principle might lead to putting more informative terms first. Rare terms may also be more informative; hence a rare-term-first order may be adopted. Rare-term-first appears from the work of Harris (Harris 1970, p. 222) to be an underlying principle for citing certain nouns before their modifying adjectives in the Library of Congress Subject Headings.

Alphabetical order has been invoked from time to time, basically on grounds of efficiency for the indexer and predictability for the searcher. Notable examples in string indexing are SLIC and ABC/SPINDEX.

Hutchins (Hutchins 1974) suggests two further principles. Both principles can be viewed as giving precedence to terms depending on the types of links connecting them to other terms in the description. Hutchins' "semantic neutrality" principle gives precedence the more the linktype is semantically neutral; i.e., the more it is applicable to different types of term. Hutchins' "obligatoriness" principle, on the other hand, gives precedence the more the linktype is obligatory; i.e., the more universally it occurs in all descriptions in the database. Hutchins considers that these two principles may explain why objects of actions tend to be cited first. The "semantic neutrality" principle applies because any type of thing may be the object of an action; the "obligatoriness" principle, because almost all actions have objects. In contrast, the agent of an action may be viewed as having to be animate, and many actions do not have distinct agents.

Hutchins' obligatoriness principle can be seen as increasing clarity for experienced searchers. For example, suppose that an action must always have an object and the object term always immediately follows the action term. Searchers will then become used to recognizing the term immediately following any action term as representing the object, without a need for connectives or the like.

5.2 APPLICATION OF CITATION ORDER PRINCIPLES TO STRING INDEXING

5.2.1 Citation order with simple index string generators

Use of an index string generator which produces index strings by simple kinds of manipulation of the input string has advantages and disadvantages.  On the one hand, such an index string generator is likely to be easy to program and cheap to run and to minimize the need for input coding. On the other { 101} hand, the resulting citation order is not always optimal. Even when the citation order of the input string is founded on suitable principles, production of multiple index strings necessarily disturbs the order of terms; as a result, an index string may violate a principle to which the corresponding input string adheres.
     More fundamentally, each simple manipulation procedure has its own strengths and weaknesses in terms of predictability, collocation, clarity, and eliminability. Thus, KWOC-like procedures may produce good results with respect to the first three criteria, provided the input strings are well constructed. But they tend to create eliminability problems because searchers are uncertain as to the links between the lead term and the terms which follow it. By contrast, in shunting or cycling, the lead term tends to be followed immediately with terms which are closely linked to it.

An extended example showing the weakness of KWOC-like procedures in comparison to shunting may prove of value here. In May of 1979, the PRECIS authority list (British Library Automated Information Service 1979) had over 100 input strings accessible under the term "Acquisition". As has been suggested independently more than once (Hunt 1977b, p. 151), a KWOC-like procedure can be applied to these input strings. Each PRECIS input string is used to produce a single unvaried description in which access terms are marked with angular brackets. For example, the input string

$z00030$dGreat Britain
$z11030$aagricultural land
$z21030$aacquisition$v&
$zg1030$aoccupancy
$z60030$areports, surveys
is used to produce the single description
Great Britain. <Agricultural land>.
<Acquisition> & <occupancy> - Reports, surveys
The set of index strings produced from this description by the KWOC-like procedure is then:
  1. Acquisition.  Great Britain. Agricultural land. Acquisition & occupancy - Reports, surveys
  2. Agricultural land. Great Britain. Agricultural land. Acquisition & occupancy - Reports, surveys
  3. Occupancy. Great Britain. Agricultural land. Acquisition & occupancy - Reports, surveys
Looking under "Acquisition", searchers thus find: { 102}
Acquisition. Adolescents deprived of personal contact. Language skills. Acquisition - Case studies
Acquisition. Animals. Development. Concepts. Acquisition by Children
Acquisition. Animals. Skills. Acquisition
Acquisition. Art objects. Acquisition. Ethics
Acquisition. Babies. Language skills. Acquisition
Acquisition. Babies. Language skills. Acquisition. Use of holophrases
Acquisition. Babies, to 18 months. Language skills. Acquisition
Acquisition. Brazilian citizenship. Acquisition. Law
Acquisition. Canada. Children's stories. Publishing. Effects of acquisition of children's stories by children's libraries in Canada
Acquisition. Canberra. Universities. Libraries: Australian National University. Library. Stock: Documents on South East Asia. Acquisition - Proposals
Acquisition. Children. Bilingualism. Language skills. Acquisition
Acquisition. Children. Concepts: Chance & probabilities. Acquisition
Acquisition. Children. Language skills. Acquisition
and so on.

One can see from the example the delays a KWOC-like index display can cause searchers from because of poor eliminability. The second term of an index string sometimes represents the object of acquisition and sometimes not. Thus, a searcher has to look to the third term of an index string like "Acquisition. Babies. Language skills. Acquisition" to guess that it is not the babies that are being acquired (by an adoption agency or whatever) but the language skills (by the babies). Sometimes, the connection of the second term with the lead term is so remote that a searcher has to look through most of the index string to see what it is. Thus, in "Acquisition.  Animals. Development. Concepts. Acquisition by children", neither the second term nor the third term is linked to the first; the fourth term is, but a searcher may have to read the final phrase "Acquisition by children" to grasp how.

Relatively good eliminability, on the other hand, is produced mostly by shunting in the actual PRECIS index strings. Here, all the index strings starting with "Acquisition" continue immediately with terms indicating what is being acquired:
{ 103}

Acquisition. Agricultural land. Great Britain
     - Reports, surveys
Acquisition. Art objects
     Ethics
Acquisition. Books. Stock. Libraries
     Selection
Acquisition. Books. Stock. Libraries. Universities. United States
     Selection. Approval plans - Reports, surveys
Acquisition. Books. Stock. Public libraries. Great Britain
     From booksellers. Delay - Reports, surveys
Acquisition. Books. Stock. Public libraries. Great Britain
     Selection
Acquisition. Brazilian citizenship
     Law
Acquisition. British citizenship
     - Statistics
Acquisition. Canadian provincial government publications. Stock. Libraries
Acquisition. Children's books, Stock. Libraries. United States
     Selection - Readings
Acquisition. Children's stories. Stock. Children's libraries. Canada
     Effects on publishing of children's stories in Canada - Reports, surveys
Acquisition. Cognitive skills
     Simulations. Applications of digital computer systems. Programs: HACKER program
Acquisition. Companies. European Community countries
     - Practical information
and so on.

The input strings for the KWOC-like process in the example mostly follow the dependency principle; but, even if a different principle is adopted for fixing the order of terms, problems arise at some point when the descriptions are sufficiently complicated. For example, adopting an ordinary-language order with prepositional phrases, which basically inverts the dependency principle, indeed yields readily eliminable index strings such as

Acquisition. Acquisition of concepts of development of animals by children
But it also yields an index string like { 104}
Animals. Acquisition of concepts of development of animals by children
where the problem of remote connection recurs.

5.2.2 Citation order with sophisticated index string generators

More sophisticated string indexing systems attempt to overcome the limitations of the simple manipulation approach in various ways. NETPAD and Relational Indexing perhaps differ most sharply from simple manipulation procedures. In these two systems, the citation order of the input string is without significance: the citation order of the index strings is determined entirely by other factors, such as the links that have been defined between terms.

In systems like LIPHIS and PRECIS, the order of terms in the input string is significant, but coding is used to mitigate the possible undesirable effects of simple manipulation routines. Thus, PRECIS does not always follow the shunting method. For example, PRECIS index strings with lead terms associated with the main role code "4" ("viewpoints"), "5" ("sample population/study region"), or "6" ("target/form") show a KWOC-like arrangement. In such index strings, collocation tends to be improved, by putting more significant terms more immediately after the lead term; any sacrifice of the context principle is considered relatively mild.

5.2.3 The context-dependency compromise

The citation order most commonly favoured for index strings by ASI, PRECIS, NEPHIS, LIPHIS, NETPAD, and PERMDEX and for input strings and subheadings by POPSI represents a sort of compromise between the context and dependency principles rather than a combination of the two.

To understand the composing of a string according to the context-dependency compromise, consider first cases in which no unused terms remain to qualify any term already included earlier in the string. Here, the next term appended is simply a term qualified by the last term that did not qualify a previous term. In such cases, both the context and the dependency principles are maintained.  For example, the PRECIS index string

Glaciated regions
    Mineral deposits. Prospecting - Conference proceedings
adheres to both context and dependency because no term qualifies any of the terms that precede it. { 105}

When an unused qualifying term does remain, the context-dependency compromise requires that it be appended to the string before any non-qualifying terms. For instance, in another index string for the same item,

Mineral deposits. Glaciated regions
    Prospecting - Conference proceedings
"GLACIATED REGIONS" is interposed to qualify "MINERAL DEPOSITS". The second part of a PRECIS index string, the "qualifier", is specially designed to hold terms interposed in this way, though terms may also be interposed in the third part, the "display". In the latter case, shunting is not sufficient to produce the desired result, and other procedures, such as the "predicate transformation", must be invoked.

In determining the order in which two or more terms qualifying the same previous term are to be inserted, a ranking of the different types of term link, like that suggested by Austin, may be of use. Thus, in the index string

Pollination. Crops
    By bees
the citing of "BEES" after "CROPS" can be justified by assuming that an action-agent link is "weaker" than an action-patient link. Hutchins' "semantic neutrality" and "obligatoriness" principles can also be applied in deciding to qualify an action by its object before its agent.

5.3 CONTROL OF CITATION ORDER

Various agents may control citation order in a string indexing system. Chief among these agents is, of course, the index string generator; but rules for composing input strings also often play a considerable role, and, in a number of cases, citation order is controlled at least in part by the indexer. Occasionally, some control is exercised by the overall database of descripitions, by the producer of the index, or even by the searcher.

5.3.1 Control of citation order by rules

In some string indexing systems, the form of the input string is determined by explicit rules, and the index string generator uses only its own rules plus the input string in determining the forms of the index strings.  The citation order of the index strings is thus highly predictable. These systems assume either that some universal principles can completely determine citation order for all searches in all indexes or that the specific searchers or collection for which indexing is done requires only one citation order.  PRECIS tends to assume the first and CIFT and MULTITERM the second. { 106}

Although the form of PRECIS input strings is somewhat flexible for very complex subjects, there are many rules which must be followed by PRECIS indexers. For example, the order of the role codes in column 3 determines a general order in input strings of: environment, objects of action, action, agents of action, viewpoints, samples, audience and form. Take, for instance, an input string for "definitions of the terminology of computer science and information science":

$z21030$acomputer science$v&
$zg1030$ainformation science
$zp1030$aterminology
$z60030$adefinitions
The order
$z60030$adefinitions
$z21030$acomputer science$v&
$zg1030$ainformation science
$zp1030$aterminology
is incorrect because a segment tagged with the role code "6" ("target/form") in column 3 must not precede a segment tagged with role code "2" ("action/effect"). Similarly, the position of "terminology" is fixed by a rule which requires a part or property to follow the whole thing to which it belongs; and "information science" is required to follow "computer science" by a rule that parallel terms must be input in alphabetical order if no other logical order is available.

Because of PRECIS' explicit rules for input strings and because the PRECIS index string generator takes account only of its own rules plus the input string, the citation order for index strings is generally predetermined for all conditions. If, for instance, a PRECIS index string names an action, the object of the action, the agent, and the place, the order is always:

  1. "action - object - place - agent", if the action is the access point -
    Rescue. Children. Great Britain
         By dogs
  2. "object - place - action - agent", if the object is the access point -
    Children. Great Britain
         Rescue by dogs
  3. "agent - place - action - object", if the agent is the access point - { 107}
    Dogs. Great Britain
         Rescue of children
  4. "place - object - action - agent", if the place is the access point -
    Great Britain
         Children. Rescue by dogs

Regularity of citation order has advantages. The user of one PRECIS index will find that all other PRECIS indexes can be searched in very much the same way. Likewise, a PRECIS indexer approaching a new collection does not have to work out, or become familiar with, a new set of rules. Within a single index, searchers may become used to one kind of collocation of index strings and one kind of meaningful order of terms, and indexers find decision-making less worrisome.

Regularity of citation order also has disadvantages. Regularity can be seen as rigidity. Do searchers always want geographical location cited so close to the beginning of the index string, for example? Might some searchers find it not especially useful in deciding which entries represent what they want? Is the same citation order suited to all types of indexed items? Might some items be described more clearly, for example, if the standard citation order is varied?

Systems like MULTITERM and CIFT partly avoid the accusation of rigidity by assuming a narrow class of searchers and indexed items. MULTITERM, for example, is designed specifically for documents in chemistry and uses a formula quite specific to chemistry to determine citation order: "chemical being prepared -> reactant(s) -> process -> reaction conditions (catalyst; solvent, etc.) -> equipment -> use of chemical prepared -> property of chemical prepared" (Skolnik 1970).

5.3.2 Control of citation order by discipline

One way in which citation order can be adapted to different types of indexed items and to different searchers with different requirements is to vary with the discipline. An entirely different set of rules for citation order need not, however, be devised for each discipline. General principles of a single string indexing system can exercise the main control over citation order while the discipline controls aspects important for the differences between disciplines. An example of an attempt at this kind of selective discipline control is provided by POPSI.

In terms of general principles, the POPSI citation order is a fixed one of discipline, entity, parts and properties, and processes. This formula is derived from Ranganathan's "Personality, Matter, Energy, Space, Time" (PMEST) formula for the construction of faceted classification schemes. { 108}

POPSI citation order also varies with the discipline, however. Take a document on the hunting of seals by Inuit. If the discipline is marine biology, the terms may be cited in the order:

MARINE BIOLOGY,SEALS:HUNTING-(BY)INUIT
If, on the other hand, the discipline is anthropology, the order may be:
ANTHROPOLOGY,INUIT;HUNTING-SEALS
The discipline determines which part of the description is treated as the entity, which part as the part/property, and which part as the process; in this way, it also determines what the exact citation order will be.

Normally, a POPSI input string or subheading contains a discipline term; but users of POPSI have the option to omit this term while still varying the citation order as though discipline terms were present. Thus, even when two descriptions have all the same terms and all the same links between the terms, POPSI allows more than one citation order following the lead term. For example, when the lead term is "HUNTING", two possible index strings for the document on "hunting of seals by Inuit" are:

HUNTING,SEALS
     SEALS:HUNTING-(BY)INUIT
and
HUNTING,INUIT
     INUIT;HUNTING-SEALS

5.3.3 Control of citation order by indexers

Giving the indexer some control of citation order can lead to inconsistency, but it may produce some better index strings where the rules are inadequate. The extreme of indexer-determined citation order is represented by TOPSI-UNIV, where terms can be included in different index strings in any combinations or permutations the indexer desires. String indexing systems which accept ordinary-language input strings allow indexers a good deal of freedom to control citation order simply by changing the order of terms in the input string. Rather than deal in detail with such fairly obvious indexer control, however, this section will describe two somewhat more subtle instances.

NEPHIS allows the indexer limited control over citation order even when the ordinary-language part of the input is the same. For example, the phrase

VARIATIONS IN THICKNESS OF LITHOSPHERE IN CANADA
can be coded as { 109}
@VARIATIONS? IN <@THICKNESS? OF <LITHOSPHERE?      IN <CANADA>>>
with the index strings
  1. CANADA. LITHOSPHERE. THICKNESS. VARIATIONS
  2. LITHOSPHERE IN CANADA. THICKNESS. VARIATIONS
or as
@VARIATIONS? IN <@THICKNESS? OF <LITHOSPHERE>? IN <CANADA>>
with the index strings
  1. CANADA. THICKNESS OF LITHOSPHERE. VARIATIONS
  2. LITHOSPHERE. THICKNESS IN CANADA. VARIATIONS
or as
@VARIATIONS? IN <@THICKNESS? OF <LITHOSPHERE>>? IN <CANADA>
with the index strings
  1. CANADA. VARIATIONS IN THICKNESS OF LITHOSPHERE
  2. LITHOSPHERE. THICKNESS. VARIATIONS IN CANADA
Which coding is chosen depends on the term to which the indexer decides "CANADA" ought to be linked. That the indexer's control is limited as long as the ordinary-language part remains the same is indicated by the fact that no coding of the phrase can produce an index string in which the terms appear in the order
CANADA   THICKNESS   VARIATIONS   LITHOSPHERE
or
LITHOSPHERE  VARIATIONS   CANADA   THICKNESS

Relational Indexing string indexers can control citation order to some extent without changing either terms or links in the input string. Indeed, such control is sometimes required. For, when an access point term is linked to more than one other term, the input string is required to indicate which link the index string generator is to follow first in producing the corresponding index string. { 110} In practice, the indexer's choice seems to be often quite clear. A fairly simple example where a real choice does exist, however, is provided by the topic of "use of Journal Citation Index, derived from Science Citation Index, in a procedure for clustering". A Relational Indexing input string for this topic is:

v=2;s=clustering
v=1;s=procedure
v=1;s=Journal Citation Index
v=1;s=Science Citation Index
l=1;w=1;r=8;w=2
w=2;p=with;r=7;p=for;w=3
w=4;p=giving;r=9;p=derived from;l=1;w=3
This input string can be approximately represented by the network diagram:
*JOURNAL CITATION INDEX
 |                   |
 for                 derived from
 |                   |
 PROCEDURE           SCIENCE CITATION INDEX
 |
 of
 |
*CLUSTERING
What this network diagram does not indicate, but the input string does, is that the index string generator must follow the link to "Science Citation Index" first when "Journal Citation Index" is the lead term. This link is represented by the last line in the input string, and it is here, rather than next to the term "Science Citation Index" itself, that the access point is marked, by means of the "l=1" before the term number "w=3". Thus, the index strings are:
  1. Clustering
         procedure with Journal Citation Index derived from Science Citation Index
  2. Journal Citation Index
         derived from Science Citation Index for procedure of clustering
The option open to the indexer here is to instruct the index string generator to follow the link to "Procedure" first. The indexer taking this option inserts "l=1" before the "w=4" in the second last line of the input string, changing it to
w=2;p=with;r=7;p=for=l=1;w=3
As a result, the index string generator produces the index string { 111}
Journal Citation Index
     for procedure of clustering. Derived from Science Citation Index
This index string is generated either instead of or in addition to index string 2 above, depending on whether the "l=1" code is omitted from the last line of the input string or not.

5.3.4 Control of citation order by the database

The ASI software is unusual in that it partly selects the citation order of more complex index strings to fit the other index strings with the same lead terms. The aim is to make the index more uniform by making more subheadings begin in the same way under a given heading and hence to improve collocation and eliminability. Take, for example, the generation of an index string with the lead term "GREAT BRITAIN" from the input string "DUST CONTROL IN COAL MINES IN GREAT BRITAIN". If there are more possible index strings beginning
GREAT BRITAIN
    COAL ...
than
GREAT BRITAIN
     DUST ...
then the ASI index string generator selects the index string
GREAT BRITAIN
    COAL MINES IN, DUST CONTROL IN
Otherwise, the index string selected is
GREAT BRITAIN
    DUST CONTROL IN COAL MINES IN
A disadvantage of this approach is that the citation order is not uniform from one part of the index to another. Thus, searchers may find the indexes more difficult to use because the index strings are less predictable. For example, under a country name in one portion of the index, subdivision might indeed be by environment, such as "COAL MINES". But there would be no guarantee that elsewhere under a country name preference for subdivision might not be given to processes, such as "DUST CONTROL"; indeed, such inconsistency in citation order might even be found under the same country name.

The overall database of input strings can theoretically be used to produce { 112} a rare-term-first order. A primitive example is seen in a version of TABLEDEX in which terms are numeric concept codes instead of ordinary-language expressions (Ledley 1958). Here, the first part of each concept code gives its frequency in the database, and thus the citation order places the less frequent terms first. As the originator realized, however, this sort of scheme requires not only citation order but also concept codes to be changed with changes to the database. The problem with such changes is not that searchers will look up outdated concept codes; they are not expected to remember the codes from search to search. Instead, they must look up the ordinary-language equivalents in a code dictionary before each search in the index. The problem is thus a serious loss of search efficiency.

5.3.5 Control of citation order by the index producer

Ancillary input, such as a stoplist, can also be used to determine the citation order of the terms following the lead term. In a system devised for the Cambridge Crystallographic Database (Allen 1980), for example, the input string is a molecular formula.  Each segment of the formula consists of a one- or two-letter atomic symbol plus, if more than one atom of the element is present, an atom count; e.g.,
C10 H18 As2 Cl3 Ge Mn O3
Separately from the input string, a stoplist of common atomic symbols is also input; e.g., C, H, N, O, S, P, Cl, Br, I. An index string consists of: an atomic symbol not on the stoplist, plus its atom count; then the atomic symbols, and atom counts, for all the other elements in the formula that are not covered by the stoplist; and finally the symbols, and counts, for the stoplist elements. Thus the index string generator, using the sample stoplist above, produces the following index strings from the sample formula:
  1. As2 Mn Ge C10 H18 Cl3 O3
  2. Ge Mn As2 C10 H18 Cl3 O3
  3. Mn Ge As2 C10 H18 Cl3 O3
The results generally approximate a rare-term-first order. Searchers are assumed to find the uncommon elements better for quickly distinguishing whether a molecule is of interest: the presence of an uncommon element carries more information than that of a common one.

A variant of KWOC (Thomas and Whitehall 1971) also allows some index-producer control of citation order.  A list of common headings is used to { 113} determine under which headings titles are to be inverted in order to bring a non-stopword to the front; under headings not on the list, uninverted titles are given. The sort of result to be expected can be illustrated for the title "THE VALUE OF FUNDAMENTAL SCIENCE". If "SCIENCE" is placed on the list of common headings while "FUNDAMENTAL" is not and "VALUE" is a stopword, the index strings for this title take the forms:

  1. FUNDAMENTAL
         THE VALUE OF FUNDAMENTAL SCIENCE
  2. SCIENCE
         FUNDAMENTAL SCIENCE / THE VALUE OF
The inverted titles theoretically provide better collocation and eliminability where most needed, while elsewhere the uninverted titles give better clarity.

5.2.6 Control of citation order by searchers

String indexing systems for producing online screen displays in response to searchers' commands can allow a searcher to change the citation order of an index string. In this respect, they are unlike string indexing systems designed for producing printed indexes to be used in the same form for a large number of searches.

NETPAD allows searchers to control citation order by means of weights assigned to the different linktypes. As an example, a NETPAD input string for an item on "preplating of high-strength steels with copper" can be displayed as:

# Term
1PREPLATING
2HIGH-STRENGTH
3STEELS
4 COPPER
# Linktype #
1O2
2T3
1W4
By giving linktype "O" a higher weight than linktype "W", a searcher can get the NETPAD software to display an index string like
PREPLATING of HIGH-STRENGTH STEELS with COPPER
or, by giving link type "W" the higher weight, one like
PREPLATING with COPPER of HIGH-STRENGTH STEELS
{ 114}

5.4 CONNECTIVES

Apart from separating terms, the main function of connectives in index strings is to increase clarity or detail by indicating to searchers the relationships between the things referred to. In this way, connectives reduce effort and possible errors. Without clarifying connectives, searchers must infer the relationships by other means. For example, they may guess what is the most likely relationship given the terms used, or they may seek patterns from other index strings.

Other purposes may also be served by connectives. Thus, eliminability may be served by using a connective to mark a point suitable for breaking off reading an index string; punctuation marks perform well in this function. Connectives may also contribute to collocation. Moreover, from the point of view of display formating, connectives may indicate suitable points for beginning subheadings.

5.4.1 Types of connectives

The main types of connectives are words, such as prepositions and conjunctions, and punctuation marks, such as commas and periods. In languages with case-endings, these case-endings could be viewed as connectives or parts of connectives following or even embedded within the terms that they connect. In practice, however, in a system such as PRECIS, the different cases of a word are treated as different terms.

Two related tendencies can be noted in the relative use of punctuation marks versus words as connectives. First, words tend to separate short segments or individual terms of the index string, while punctuation tends to separate long segments. Second, words tend to express links followed for qualification and punctuation to express links followed for other purposes. One source can be seen in ordinary language, which tends to distribute its use of connectives in a similar way. Punctuation marks, or their spoken equivalents in pauses, intonation, etc., more often separate longer, more independent segments of discourse; words such as prepositions divide smaller, less independent segments.

The tendency for larger segments to be separated by punctuation and smaller segments by words is expressed quite specifically in CIFT by its distinction of two levels of role code. "Facet" codes in input strings generally correspond to periods separating longer segments in the index strings; "role" codes apply to smaller segments or individual terms and are represented by words and phrases such as " - APPLICATION", " - AS INFLUENCED BY", "Treatment of", "compared to", "Role of", "by", "in".

Using prepositions to represent qualification links and punctuation to { 115} represent other links is best illustrated by NEPHIS. In a NEPHIS input string, the qualifying terms normally follow the terms that they qualify. The qualification linktypes are indicated by means of "forward-reading" connectives, preceded by question marks ("?") and followed by lefthand brackets ("<"); these connectives are usually prepositions. When the linked terms appear in the index string in the same order as in the input string, the index string generator inserts these connectives. By contrast, periods are normally inserted when the citation order of the index string reverses that of the input string. For instance, the input string

Confidentiality? of <Records? of <Circulation? in <Libraries? in <Ontario>>>>
places all qualifying terms after the corresponding qualified terms. The index string which preserves the same citation order shows only prepositions as connectives:
Confidentiality of Records of Circulation in Libraries in Ontario
On the other hand, the index string which completely reverses the citation order of the input string shows only periods as connectives:
Ontario. Libraries. Circulation. Records. Confidentiality

A major disadvantage of words, especially prepositions, as connectives is the unnecessarily subtle and often meaningless distinctions required by the idiom of ordinary language. Farradane and Gulutzan cite the different ways in which location is represented in a sentence like "I am staying on the top floor at a hotel in the town" (Farradane and Gulutzan 1977). Similarly, the different prepositions, with or without articles, used to express geographical location in French were one of the problems dealt with in the PRECIS/Translingual Project (Matter 1979). Indexers can cater to ordinary-language idiom by including the contextually appropriate connective words in the input strings; either whenever connective words are required, as in PRECIS, or when the words depart from standard values, as for Relational Indexing. Nevertheless, connective words appearing early in an index string will tend to detract from collocation if regarded in sorting and to confuse searchers if disregarded.

Punctuation marks are less restricted by ordinary-language idiom, but, by the same token, can suffer from lack of detail or clarity. Thus, PRECIS' use of periods before qualifying terms in the "qualifier" part of an index string largely avoids the sort of dilemma created by prepositions. Searchers not familiar with the PRECIS format may hesitate, however, at the different function of the period in the later, "display", part; namely, to before non-qualifying terms. POPSI employs, at least in subheadings, different { 116} punctuation marks for the two different functions: before qualifying terms, hyphens; before other terms, other symbols such as semicolons, colons, and periods. Oddly, string indexing systems have made little use of parentheses, though these are perhaps the punctuation most suggestive of the qualification function in ordinary language and are so used in many indexing languages.

CASIN avoids both punctuation marks and prepositions, especially in the first part of its subheading, by favoring constructions in which terms follow each other immediately. The most usual such construction is a noun preceded by a sequence of adjectival modifiers, sometimes rather ambiguously expressed; for example, "aflatoxins containing foods" meaning "foods which contain aflatoxins" (Schneider 1976, p. 44).

Appropriately chosen terms can reduce the need for a number of different kinds of connective to avoid ambiguity as to the types of links between terms. One method is to choose additional terms to be interposed. For example, the topic "journals in university libraries" is distinguished from a topic such as "journals about university libraries" in PRECIS by the addition of the term "stock":

$z11030$auniversities
$zp1030$alibraries
$z10220$auniversity libraries
$zp0030$astock
$zp1030$ajournals
Alternatively, a term may be changed to suggest a particular type of link to other terms; for instance, while "ABSTRACTS. INFORMATION" might mean either "INFORMATION in ABSTRACTS" or "INFORMATION about ABSTRACTS", "ABSTRACTS. INFORMATION CONTENT" clearly means the former.

A connective may sometimes be used to imply the type of link existing between two terms that it does not actually connect. For example, in the index string

Economic conditions. Great Britain
    Influence of trade unions
the "of", which connects "Influence" to "trade unions", also serves to imply that an "of" link does not exist from "Influence" to "Economic conditions". If an "of" link does not exist, the next most likely link normally understood by a searcher here will be an "on" link. By contrast, changing the preposition to a period may suggest that British economic conditions are influencing rather than being influenced:
Economic conditions. Great Britain
    Influence. Trade unions
{ 117}

5.4.2 Positions of connectives

Normally, connectives stand between the terms that they connect, but this is not a universal rule. The most common exception for English-language string indexing is the use of "backward-pointing" prepositions, as in
Animal tissues
     Phenoxy herbicide residues in
This technique, greatly favoured by compilers of book indexes, is fundamental to ASI and CASIN, optional in NEPHIS, occasional in the Relational Indexing system, and absolutely forbidden in PRECIS. Using it seems partly a matter of taste: its advantages are that it makes index strings clearer, by specifying linktypes, while producing collocation by terms rather than by possibly capricious ordinary-language connectives; its detractors point to its unnatural inverted structure, which may be a hindrance to searchers.

Even when a connective comes between the terms that it connects, other terms may intervene. The most obvious case, noted above, occurs where connectives marking qualifying links connect over shorter distances than those marking other links. For example, in the index string

Records of Circulation in Libraries in Ontario. Confidentiality
each of the prepositions connects immediately adjacent terms, while the period connects "Records" to "Confidentiality".

5.4.3 Structural ambiguity in index strings

As index strings become more complex, it may become unclear which terms are linked by the connectives. For example, is a searcher to interpret
EXTENSION of TIME for CONSTRUCTION of FENCE by JUDGE
as representing the structure
*EXTENSION
|
of
|
*TIME
|
for
|
*CONSTRUCTION---by---*JUDGE
|
of
|
*FENCE
{ 118} in which a judge is to build a fence, or the structure
*EXTENSION---by---*JUDGE
|
of
|
*TIME
|
for
|
*CONSTRUCTION
|
of
|
*FENCE
in which the judge extends the time required for someone else to build a fence?

The use of different types of connective depending on the structure is one solution to the structural ambiguity problem. For example, PRECIS-style index strings would distinguish the two cases above:

Extension. Time. Construction of fence by judge.

Extension. Time. Construction. Fence
    By judge

Structural ambiguity may also be avoided by the use of adjectives. In English, this generally also involves a change in citation order; e.g.,

EXTENSION of TIME for JUDGE'S CONSTRUCTION of FENCE

EXTENSION of FENCE-CONSTRUCTION TIME by JUDGE
PRECIS guides indexers to express certain types of link through codes in column 3 and other types through the three-character codes for adjectives; the result tends to be a mixture of different grammatical constructions in index strings which aids comprehension in a way similar to that of good style in ordinary language. At the same time, the grammatical construction chosen is fairly predictable and consistent. Difficulties arise when no adjective can adequately express the sense required for qualification. For example, PRECIS' main role codes do not cover "of" in the sense of "in imitation of"; but there is no English adjective conceivable, still less acceptable to searchers, meaning "in imitation of items associated with the Indians of North America". Faced with this problem, a PRECIS indexer at the British Library made the phrase "toys of items associated with Indians of North America" into a single term.

A link represented in the input string is sometimes not represented at all in some of the resulting index strings.  In consequence, some detail on the relationships between the things referred to is lost.  At the same time, the { 119} index strings are less bulky and usually more quickly comprehended. For example, the index string

COURSES in UNIVERSITIES. ATTITUDES of STUDENTS
gives no explicit indication of the relationship between the students and the universities. Making the relationship more explicit by means of an additional connective, on the other hand, requires longer or less comprehensible index strings; for example,
Courses
    in Universities. Attitudes of students in -,
which could be produced by the Relational Indexing index string generator, or
Courses
    in Universities. Attitudes of Students in Universities
which is possible in LIPHIS.

Chapter 5 Summary

The syntax of an index string includes both the citation order of the terms and the connectives used to represent links between them. Although most existing theory of citation order is oriented toward systems in which there is only one entry per item, much of it is nevertheless applicable to string indexing. The most important principles, or rules, of citation order for string indexing are probably those of context and dependency; wherever possible, according to these principles, a term should be adjoined by those terms which qualify it, and a qualified term should follow a qualifying term. Other principles of citation order include: first-coming-to-mind, natural language, decreasing concreteness, more informative terms first, alphabetical, semantic neutrality, and obligatoriness.

Production of the citation order of the index string by simple manipulation of that of the input string, such as by KWOC procedures, cycling, or shunting, is more efficient than more sophisticated methods. On the other hand, each simple manipulation procedure produces index strings that lack certain desirable qualities. For example, KWOC-like procedures generate index strings with poor eliminability because the lead term is often not linked to the terms which immediately follow it. More sophisticated index string generators attempt to overcome the limitations of the simple manipulation approaches in various ways.

A number of string indexing systems favor a compromise between the context and dependency principles, with qualifying terms being inserted as soon as possible after the terms that they qualify if they cannot precede them. { 120}

Various agents may control citation order in a string indexing system. Control by rules promotes consistency but may be seen as excessively rigid. Limited control by the discipline of the indexed items takes some account of needs of different searchers for different kinds of access. Control by the indexer may be inconsistent but may improve some good qualities of index strings, such as clarity. Some control by the database can improve collocation and eliminability. In a few cases, the index producer, or even the searcher, is able to influence citation order.

The main function of connectives in an index string, apart from separating the terms, is to increase clarity or detail. The main types of connectives are words and punctuation. Words tend to be used when the terms connected are in close proximity and when the second term qualifies the first. A major disadvantage of words lies in the idiomatic constraints of natural language; punctuation, on the other hand, may lack detail or clarity. The need for a number of different kinds of connectives can often be avoided by appropriate choice of terms or connectives.

The position of a connective is normally between the terms that it connects. Even so, other terms may intervene. The resulting ambiguities as to which terms are connected can be avoided by using different connectives depending on the structure or by selective use of adjectives.

Loss of detail on links between terms is often tolerated to reduce index bulk and to improve clarity.

<-- Chapter 4: Indexer Aids Contents Chapter 6: Other Aspects of Index Strings -->