LIS 558 - Information retrieval operations

Operations on documents

Operations on terms

Stemming

This is intended to conflate related words, usually by reducing them to a common root by removing recognized suffixes. It may be done both at the indexing stage and at the query stage.

Weighting

A term may be weighted automatically in relation to a database on the basis of how concentrated it is in a particular subset of documents. It may also be weighted in relation to a document on the basis of how frequently it occurs in the document.

Stopping

Many of the most frequently occurring words make ineffective search terms; e.g., like, the, and, to, of, an, out, a. Such words can be put on a stoplist and filtered out during processing of the index and/or queries.

Generally stop lists should be employed conservatively.

Synonym dictionaries and thesauri

These may be applied to standardize terms automatically during both indexing and searching or to enhance indexing and queries.

Operations on queries

Boolean operations

AND, OR, and NOT operators are offered by most commercial systems. Efficiency can be greatly increased by using an inverted file to identify documents containing a particular term. Results are determined through the merging of sets of document identifiers.

Adjacency/proximity operations

Can be accomplished by merging sets augmented with location and/or field of each term occurrence within a document. If this additional information has to be stored in the inverted file, storage requirements are considerably increased.

An alternative is filter the results of a Boolean AND by processing the individual document records.

Range searching

This may be accomplished by ORing sets for individual matching terms.

Truncation

Left truncation can be dealt with as a kind of range searching. If right truncation is desired, a separate index of backwards terms will usually be needed, though the two indexes can use the same sets of document identifiers.

Ranking documents by probable relevance

This is available in many newer systems, especially on the Web, and is typically based on measures of the frequency of a word within a document compared to its frequency in the database.
Home

Last updated July 5, 2001.
This page maintained by Prof. Tim Craven
E-mail (text/plain only): craven@uwo.ca
Faculty of Information and Media Studies
University of Western Ontario,
London, Ontario
Canada, N6A 5B7