LIS 558 - Information retrieval operations
Operations on documents
- Assignment of unique identifiers.
- Addition to (or deletion from) the database.
- Selection of fields for indexing or display.
- Parsing of fields (or segments).
- Posting of document identifiers to inverted file(s)
(or deletion of postings)
- Display of documents.
Operations on terms
This is intended to conflate related words,
usually by reducing them to a common root
by removing recognized suffixes.
It may be done both at the indexing stage and at the query
A term may be weighted automatically in relation to a database
on the basis of how concentrated it is
in a particular subset of documents.
It may also be weighted in relation to a document
on the basis of how frequently it occurs
in the document.
Many of the most frequently occurring words
make ineffective search terms;
e.g., like, the, and, to,
of, an, out, a.
Such words can be put on a stoplist
and filtered out during processing of the index and/or queries.
Generally stop lists should be employed conservatively.
Synonym dictionaries and thesauri
These may be applied to standardize terms automatically
during both indexing and searching
or to enhance indexing and queries.
Operations on queries
AND, OR, and NOT operators are offered by most commercial
Efficiency can be greatly increased by using an inverted file
to identify documents containing a particular term.
Results are determined
through the merging of sets of document identifiers.
Can be accomplished by merging sets
augmented with location and/or field
of each term occurrence within a document.
If this additional information
has to be stored in the inverted file,
storage requirements are considerably increased.
An alternative is filter the results of a Boolean AND
by processing the individual document records.
This may be accomplished by ORing sets for individual matching
Left truncation can be dealt with as a kind of range searching.
If right truncation is desired,
a separate index of backwards terms will usually be needed,
though the two indexes can use the same sets of document
Ranking documents by probable relevance
This is available in many newer systems,
especially on the Web,
and is typically based on measures of the frequency of a word
within a document compared to its frequency in the database.
Last updated July 5, 2001.
This page maintained by
Prof. Tim Craven
E-mail (text/plain only): email@example.com
Faculty of Information and
University of Western
Canada, N6A 5B7