TexNet32 - Extracting



The automatic generation of extracts from full texts proves to be very useful to abstractors. They tend to use automatic extracts as a starting point for creating their own abstracts.

When you first choose any of the extract options for a new or modified full text, TexNet32 automatically word indexes each paragraph in the text, counting the number of times each word occurs. This process seems to be somewhat slower in TexNet32, which is written in Delphi 6, than in TexNetF, which was written in Borland Pascal for Windows, because of differences in string handling in the two development environments.

Currently all extraction is done from the whole source text. No provision is made for hiding materials that should not be counted, such as illustrations or references, and this may make the extraction less useful. To avoid these results, it is advisable to delete peripheral materials from the text in main memory and then request the extracts.

Paragraph extracts

TexNet32 can display several different forms of extracts consisting of complete paragraphs. Such extracts are derived by weighting and other techniques described below.

The "Boolean" extract displays paragraphs matching a Boolean query supplied by the user. In addition to the Boolean operators and, or, and not, you can also use the proximity operators with (for words together in the same order) and near (for words together in either order), either of which may have a number appended to specify a maximum number of intervening words (for example, with3 or near12). The question mark (?) can be used for right truncation. If no operators are specified, or is assumed.

The "Stem matches" extract shows all paragraphs that contain at least one stem from a text specified by the user. This text could be a word or a phrase from the full text on which the user would like more information.

The "Weighted" extract consists of those paragraphs that have been assigned the highest weight according to various methods that must be invoked by the user. Since the weights are initially all set to zero, invoking a "Weighted" extract without first invoking any weighting methods yields simply the whole of the full text (not just the initial part, as in TexNetF). (For weighting methods, see the "Weighting" section".

Before using extracting options, you may set up the minimum length of extracts in characters and as a percent of the full text (see the "Parameters" section).

"Sized by weight" always extracts all paragraphs from the full text, but adjusts font sizes to suggest relative paragraph weights.

Athough not a type of extract as such, the "Word wrap" option in the "Edit" menu is intended to replace the "Incipits" extract available in TexNetF.

Words

The "Frequent keywords" option extracts full-text words determined automatically on the basis of frequency with stop-words being omitted.

All non-stop-words are divided into "frequent" and "infrequent" on the basis of a threshold set by the user in the "Parameters" window. After adjusting the threshold, just select the "Frequent keywords" option again to see the new list.

The "Unusual words" option extracts the 5-10% of the words that are relatively frequent in the text when compared with the currently open database index. "Frequent" words are highlighted in the extract.

The words can be copied into the abstract window in any order by clicking on each word and typing Ctrl-Space.

Phrases and other short passages

TexNet32 currently provides three different kinds of extracts of phrases or other short passages, represented by the "Phrases", "Passages", and "Repeats" options.

When the "Phrase" option is selected, the program enumerates all different word sequences in the text up to a certain maximum length, which currently is 10. The method employed - to store each word only once and to embody phrases in multi-level trees of pointers - allows the entire enumeration to take place in main memory even for a fairly long documents. Because of the Delphi string handling model, processing is somewhat slower than in TexNetF.

Each phrase thus derived is assigned a score. The current scoring method is fairly simple: each phrase is given a strength equal to the number of "frequent" words that it contains; it is then given a "score" equal to its strength multiplied by one minus the number of times that it occurs in the text. If a phrase's score is at or above the threshold for a frequent word, the phrase is selected. A slightly different scoring method was used in TexNetF, but the results are similar.

The selected phrases are inserted in the "Extract" window in a compact format: no phrase that is a subphrase of any other selected phrase is displayed separately. Each of the remaining longer phrases is displayed in a separate paragraph. You can copy a selected phrase to your abstract by typing Ctrl-O or a single word by typing Ctrl-Space. To indicate which part of a long phrase might be more significant, "frequent" non-stopwords are highlighted in the subwindow.

In summary, the "Phrase" display option is based on the following principles:

The "Passages" option extracts passages that contain "frequent" non-stopwords in the sequence in which they appear in the original text.

The "Repeats" option extracts passages that contain non-stopwords that are repeated within a certain number of words (currently 32). The passages are extracted in the order in which they first appear in the text.

Sentences

The "Sentences" extraction option provides a simple ranked display of the sentences that appear to be most important on the basis of density of frequent words and positive cues and absence of negative or cohesion cues. After each sentence, its score is included in square brackets. This option will work regardless of whether the text has been properly divided into paragraphs and may be a handy alternative to paragraph extraction for that reason.

If the sentence extract is empty, this is likely because no sentence received a score greater than zero. You can usually remedy this situation by decreasing the frequent word threshold. (see the "Parameters" section).

Before using the sentence extraction option, you may set up the minimum length of extracts in characters and as a percent of the full text (see the "Parameters" section).


Last updated February 5, 2008, by Tim Craven