Query-independent Customized Index Entry Formats in a Concept-network Management System

(Unpublished paper, 1985)

Timothy C. Craven
School of Library and Information Science
The University of Western Ontario
London, Ontario
N6G 1H1
Canada

BACKGROUND

Earlier documentation (Craven 1982d, 1983b, 1984; Declerck and Craven 1983) has described a microcomputer-based system by which a human indexer can input and edit in graphic form networks of concepts and concept links from which a variety of permuted index displays can be produced on demand.

Using the graphic editing capability of the system, an indexer defines concept nodes by typing in appropriate terms, which are then displayed in a "staircase" accross the screen. The indexer can also define links between concept nodes; these links are marked by one-character mnemonics, accompanied by slashes and, where the linked nodes are distant, by vertical and horizontal lines. For example, in indexing an item on "ignition of methane in the air in coal mines by sparks", the indexer might build up the display:


    IGNITION
    |/O
    |  METHANE
    |   /I
    |     AIR
     /B   |
       -- |--SPARKS
           /I /I
             ---COAL MINES

Here, the "/O" link indicates object; the "/I" links, environment; and the "/B" link, agent.

A different set of link types is defined for each database, and these definitions may be changed at will. Part of the definition of a link type is its one-character mnemonic. Other parts are a pair of "connectives" and a pair of "weights". Connectives are words, phrases, or punctuation marks that the software uses to express a link of that type in the permuted index entries. Weights are values indicating the strength of the link relative to other links and they are used by the software in determining citation order, heading-subheading division, and specificity of the permuted index entries.

In the type of index display originally developed for this system, the user's control over linktype definitions and over threshold values associated with the weights provides many possibilities for variation in the format of index entries. For example, different permuted index entries beginning with "IGNITION" that could all correspond to the graphic display given above include:

  1. IGNITION of METHANE in AIR in COAL MINES by SPARKS
  2. IGNITION by SPARKS in COAL MINES of METHANE in AIR
  3. IGNITION of METHANE
  4. IGNITION of METHANE
  5. IGNITION

On the other hand, all such variations of display format assume a particular kind of query with a particular kind of relationship to the corresponding index entries. Specifically, the searcher is expected to type in a single term or a single string of initial characters, and every permuted index entry displayed in response will begin with that term or that string of characters.

These assumptions about the query and its relationship to the index display seem fairly appropriate as long as the number of index entries, for all descriptions in the database, that begfin with a given term remains relatively small. User effort is saved by the use of simple queries, and the displays presented in response can be scanned quickly.

For prolific terms, however, the number of index entries may become unmanageably large. Suppose, for example, that the database has a coverage similar to that of the British Library BLAISE database (British Library Automated Information Service 1979). An indexer who enters the search term "ACQISITION" may then be faced with many screenfuls of index entries, starting with something like:


    ACQUISITION of AGRICULTURAL LAND in GREAT BRITAIN . SURVEYS &
              REPORTS
    ACQUISITION of ART OBJECTS . ETHICS
    ACQUISITION of BOOKS by LIBRARIES . SELECTION
    ACQUISITION of BOOKS by LIBRARIES of UNIVERISITIES in UNITED
              STATES . SELECTION . APPROVAL PLANS . SURVEYS &
              REPORTS
    ACQUISITION of BOOKS by PUBLIC LIBRARIES in GREAT BRITAIN .
              SELECTION
    ACQUISITION of BOOKS by PUBLIC LIBRARIES in GREAT BRITAIN
              from BOOKSELLERS . DELAY . SURVEYS & REPORTS
    ACQUISITION of BRAZILIAN CITIZENSHIP . LAW
    ACQUISITION of BRITISH CITIZENSHIP . STATISTICS
    ACQUISITION of CHILDREN'S BOOKS by LIBRARIES in UNITED STATES
              SELECTION . READINGS
    ACQUISITION of CHILDREN'S STORIES by CHILDREN'S LIBRARIES in
              CANADA . EFFECTS relating to PUBLISHING of
              CHILDREN'S STORIES
    ACQUISITION of COGNITIVE SKILLS . SIMULATION . USE of HACKER
              PROGRAM
    ACQUISITION of COMPANIES in EUROPEAN COMMUNITY . PRACTICAL
              INFORMATION
    ACQUISITION of CONCEPTS of DEVELOPMENT of ANIMALS by CHILDREN
    ACQUISITION of CONCEPTS of PROBABILITIES & CHANCE by CHILDREN
    ACQUISITION of DOCUMENTS relating to SOUTHEAST ASIA by
              LIBRARY of AUSTRALIAN NATIONAL UNIVERSITY .
              PROPOSALS

To avoid excessively long index displays, as well as for other reasons, searchers may prefer to be able to enter other kinds of search specifications. Search specifications involving Boolean logic are one obvious example, but the use of other kinds of complex specifications may also be desirable: simple lists of search terms; lists of weighted search terms; substructures to be matched against parts of descriptions, as in TOSAR (Fugmann and other 1974) or Relational Indexing (Farradane 1980a, 1980b; Farradane and Thompson 1980); citations to known relevant documents; and so on.

THE QUERY-INDEPENDENT DISPLAY FORMAT

The purpose of the present article is to consider a somewhat different type of index display from that originally designed. This new type of index display is intended to make that form of the index entries independent of the search specification or the searching method used to retrieve the descriptions; in other words, it should be usable for index displays of all sorts of documents sets, regardless of how these sets were derived. At the same time, however, the type of index display to be considered is designed to retain customizability based on the ability of the user to change linktype definitions and the "subheading" and "cutoff" threshold values.

Apart from query-independence and customizability, the new format has one other major characteristics and two major preferences. The major characteristic is that it includes all the terms in the description. Its major preferences are for: 1. a term order in which qualifying terms follow the terms that they qualify; 2. the use of connectives to distinguish the types of links between concepts.

How the format behaves will be shown by way of an example. Suppose that that Boolean query "ACQUISITION AND CHILD*" retrieves descriptions for seven documents:

  1. "acquisition of language skills by children and babies"
  2. "acquisition of the concepts of probabilities and chance by children"
  3. "effects of bilingualism on the acquisition of language skills by children"
  4. "acquisition of language skills by children"
  5. "children's acquisition of concepts of the development of animals"
  6. "effects of the acquisition of children's stories by children's libraries in Canada on the publishing of children's stories in Canada"
  7. "readings on the selection of children's books for acquisition by libraries in the United States"

If the documents have been appropriately indexed, the new format typically gives the following display:


    ACQUISITION of CONCEPTS of DEVELOPMENT of ANIMALS by
              CHILDREN
    ACQUISITION of CONCEPTS of PROBABILITIES & CHANCE by CHILDREN
    ACQUISITION of LANGUAGE SKILLS by CHILDREN
        & BABIES
    EFFECTS of ACQUISITION of CHILDREN'S STORIES by CHILDREN'S
              LIBRARIES in CANADA relating to PUBLISHING OF
              CHILDREN'S STORIES
    EFFECTS of BILINGUALISM of CHILDREN relating to ACQUISITION
              of LANGUAGE SKILLS
    READINGS relating to SELECTION of CHILDREN'S BOOKS by
              LIBRARIES in UNITED STATES relating to ACQUISITION

Before proceeding further, a couple of points should be noted about this display. First, the entries "ACQUISTION of LANGUAGE SKILLS by CHILDREN" and "ACQUISITION of LANGUAGE SKILLS by CHILDREN & BABIES" are grouped together under a common heading "ACQUISITION of LANGAUGE SKILLS by CHILDREN"; this grouping can be eliminated if desired by lowering the "subheading" threshold value. Second, the number of linktypes has been limited by the database designer, with the "relating to" linktype being used as a sort of catchall; hence, the rather stilted expressions "relating to PUBLISHING" and "relating to ACQUISITION".

Each of the entries is independent of the query. For example, the first entry would remain "ACQUISITION of CONCEPTS of DEVELOPMENT of ANIMALS by CHILDREN" regardless of whether the query were "ACQUISITION AND CHILD*", "CHILDREN", "ANIMALS AND DEVELOPMENT", "CONCEPT* OR IDEA*", or any other Boolean or nonBoolean formulation satisfied by the document in question.

All the terms are also included in each entry. The result is that each entry is a more or less complete description of the document.

The format's preferences for postposing qualifiers and for distinguishing linktypes through connectives are fully expressed here. In longer descriptions, notably in "EFFECTS of ACQUISITION of CHILDREN'S STORIES by CHILDREN'S LIBRARIES in CANADA relating to PUBLISHING OF CHILDREN'S STORIES", the result may be somewhat difficult to follow; but, in shorter descriptions, the meaning is generally clear and quickly assimilated.

CUSTOMIZABILITY OF THE FORMAT

Customizability can be shown by illustrating some results of the user's changing certain values without changing the underlying document descriptions.

Suppose first that the user changes the definition of the "by" linktype so that it has a higher weight than the "of" linktype. The resulting display is:


    ACQUISITION by CHILDREN & BABIES of LANGUAGE SKILLS
    ACQUISITION by CHILDREN of CONCEPTS of DEVELOPMENT of ANIMALS
    ACQUISITION by CHILDREN of CONCEPTS of PROBABILITIES & CHANCE
    ACQUISITION by CHILDREN of LANGUAGE SKILLS
    EFFECTS of ACQUISITION by CHILDREN'S LIBRARIES in CANADA of
              CHILDREN'S STORIES relating to PUBLISHING OF
              CHILDREN'S STORIES
    EFFECTS of BILINGUALISM of CHILDREN relating to ACQUISITION
              of LANGUAGE SKILLS
    READINGS relating to SELECTION by LIBRARIES in UNITED STATES
              of CHILDREN'S BOOKS relating to ACQUISITION

Here, more emphasis is being placed on the agent as a way of distinguishing one process from another and less on the patient; e.g., more on "children and babies" and less on "language skills" in the first entry. In a longer display, such a change might have important implications for the grouping of the entries; here, it only serves to separate slightly the two entries relating to "language skills".

Second, suppose that the user raises the subheading threshold, above the weight of the "of" linktype. The resulting display is:


    ACQUISITION by CHILDREN & BABIES of LANGUAGE SKILLS
    ACQUISITION by CHILDREN of CONCEPTS of DEVELOPMENT of ANIMALS
       of PROBABILITIES & CHANCE
      of LANGUAGE SKILLS
    EFFECTS of ACQUISITION by CHILDREN'S LIBRARIES in CANADA of
              CHILDREN'S STORIES relating to PUBLISHING OF
              CHILDREN'S STORIES
      of BILINGUALISM of CHILDREN relating to ACQUISITION of
              LANGUAGE SKILLS
    READINGS relating to SELECTION by LIBRARIES in UNITED STATES
              of CHILDREN'S BOOKS relating to ACQUISITION

The display is more compact and, in that respect, easier to scan; on the other hand, the person scanning it may have difficulty in attaching, to the line "of PROBABILITIES & CHANCE", the appropriate heading-plus-subheading "ACQUISITIION by CHILDREN of CONCEPTS".

Third, suppose that the user raises the cutoff threshold somewhat, above the weights of the "in" linktype and of the catchall "relating to" linktype. The resulting display is:


    ACQUISITION by CHILDREN & BABIES of LANGUAGE SKILLS
    ACQUISITION by CHILDREN of CONCEPTS of DEVELOPMENT of ANIMALS
       of PROBABILITIES & CHANCE
      of LANGUAGE SKILLS
    EFFECTS of ACQUISITION by CHILDREN'S LIBRARIES of CHILDREN'S
              STORIES . PUBLISHING OF CHILDREN'S STORIES . CANADA
      of BILINGUALISM of CHILDREN. ACQUISITION by CHILDREN of
              LANGUAGE SKILLS
    READINGS . SELECTION by LIBRARIES of CHILDREN'S BOOKS .
              ACQUISITION by LIBRARIES of CHILDREN'S BOOKS .
              UNITED STATES

Because the query-independent format is required to include all terms in every entry, raising the cutoff threshold does not shorten the entries, as it does in the original query-dependent format. Instead, the typical effect is to chop up an entry into several segments separated by periods. When these segments are added together, the overall entry may in fact be longer than it would be with the cutoff threshold lower. Note the repetition of "by LIBRARIES of CHILDREN'S BOOKS" in the final entry above.

For longer entries, the chopping up may improve readability, especially when a relatively troublesome linktype is brought below the threshold and need no longer be expressed. Thus, the entry "EFFECTS of ACQUISITION by CHILDREN'S LIBRARIES of CHILDREN'S STORIES . PUBLISHING of CHILDREN'S STORIES . CANADA" is probably clearer and more readable than "EFFECTS of ACQUISITION by CHILDREN'S LIBRARIES in CANADA of CHILDREN'S STORIES relating to PUBLISHING of CHILDREN'S STORIES" even though the latter distinguishes more different links between concepts.

The requirement of retaining all terms even when the cutoff threshold is raised derives basically from the difficulty of deciding, without reference to any query, which terms should be retained and which might safely be dropped. A simplistic rule such as retaining only the initial terms could yield generally unhelpful entries such as an unqualified "READINGS".

As the cutoff threshold is raised further, the ultimate result is for no links to be expressed explicitly in any of the entries. Each entry thus becomes simply a list of terms; e.g.,


    ACQUISITION . CONCEPTS . CHILDREN . CHANCE . PROBABILITIES
      . DEVELOPMENT . ANIMALS . CHILDREN
     . LANGUAGE SKILLS . CHILDREN
       . BABIES
    EFFECTS . ACQUISITION . CHILDREN'S STORIES . CHILDREN'S
              LIBRARIES . PUBLISHING . CANADA . CHILDREN'S
              STORIES
     . BILINGUALISM . ACQUISITION . LANGUAGE SKILLS . CHILDREN
    READINGS . SELECTION . ACQUISITION . CHILDREN'S BOOKS .
              LIBRARIES . UNITED STATES

In general, such entries are likely to be relatively less useful to searchers than entries in which at least some types of concept links are indicated.

CONCLUSION

This article has illustrated, for a concept-network-based indexing system, one possible index entry format that is both customizable and query independent. The customizing features of the format allow the user some control over term order and over how and whether concept links of various types are expressed. Redefinition of linktypes and resetting of threshold values yield entries that are relatively strong or weak in various aspects.

The requirement for query independence restricts the format quite severely. Notably, it leads to the inclusion of every term in all variations of entry.

The strong preference for the postposing of qualifying terms may not be the best choice. A mixed system in which certain essentially qualifying terms precede the terms that they qualify, as in many existing systems of subject headings, including the"feature" headings of PRECIS (Austin and Dykstra 1984) might be more appropriate.

Another area for future investigation is the development of more sophisticated query-dependent entry formats. In this area, formats specially suited to the display of responses to Boolean queries seem especially important.

TECHNICAL NOTE

The query-independent entry format illustrated in this paper has been implemented as part of the NETPAD experimental concept-network management system. The current version of the NETPAD software is written in BASICA for use on MS-DOS/PC-DOS microcomputers. Anyone interested in obtaining a copy of the NETPAD software should send a dual-density 5 1/4-inch floppy diskette to the author.

ACKNOWLEDGEMENT

The research reported in this paper was supported in part by Individual Operating Grant A3027 of the Natural Sciences and Engineering Research Council of Canada.

REFERENCES

Notes, 2010
Program files for NETPAD are no longer available.

Home

Last updated June 15, 2010, by Tim Craven