ExtPhr32

ExtPhr32 extracts every word and every phrase up to a certain number of words in length that occurs at least a minimum number of times in a source text file and that does not start or end with a stop word.

When run, ExtPhr32 first displays a "Select stoplist" dialog (unless it finds a Extpconf.txt file as described under "Run script..." below). Click on the "OK" button to use the default stoplist Stoplist.txt; or select your own stoplist and click on "OK"; or click on "Cancel" to use no stoplist.

ExtPhr32 now displays its main window.

File menu

Extract from...

Use "File|Extract from..." (Ctrl-E) to analyze a text file. When you select this option, an "Extract from..." dialog will appear. Select the file that you want and click on "OK".

If the minimum occurrences value is less than 2, a "Minimum occurrences < 2" message will appear, warning you that a runtime error may occur and asking whether you wish to proceed. Click on "Yes" or "No" depending on what you want to do.

If the file is longer than 400,000 bytes, a "Long file" message will appear. Click on "Yes" or "No" depending on what you want to do.

As ExtPhr32 analyzes the file, it will show its progress in the first panel of the status bar. When the analysis is complete, An information dialog will appear, showing the value of Simpson's l (a measure of word repetition) and the total number of word occurrences in the file; click on "OK". ExtPhr32 will now show the occurrences of the more frequent words and phrases in the text file that do not begin or end with stop words:

.

The following extract from an ExtPhr32 display

	CAMPUS 8
        CONTRACTING 5
        EMPLOYEES 5
        FOOD 9
        .SERVICES 7
shows, for example, that the source file contained 9 occurrences of the word "food" and 7 occurrences of the phrase "food services".

Because reasons for capitalization vary in languages using Roman alphabets, ExtPhr32 converts all letters to upper case to provide a standard form for each word or phrase. The rule followed for case conversion depends on the locale or language setting of your machine.

Save as...

Use "File|Save as..." (Ctrl-S) to save the displayed analysis in a file.

Print...

Use "File|Print" (Ctrl-P) to print the displayed analysis. When you select this option, ExtPhr32 will display a "Print" dialog, to allow you to specify which printer or other device you prefer. Click on "OK" to proceed with printing or "Cancel" if you decide not to print.

Exit

Use "File|Exit" when you are finished with ExtPhr32.

Edit menu

Copy

Before using "Edit|Copy" (Ctrl-C), select the part of the displayed analysis that you want by dragging with the mouse or using Shift with the cursor control keys or by selecting "Edit|Select all". Then use "Edit|Copy" to copy the selected text to the Windows clipboard.

Copy as keywords

This is a variation of "Copy" that strips out the numbers and converts the words and phrases selected into a lower-case list partitioned by commas, suitable for pasting into the content of an HTML "keywords" meta tag. It is disabled if "Show full phrases" is off.

Select all

Use "Edit|Select all" to select all of the displayed analysis.

Collapse

Usable only if full phrase display is selected. This function eliminates all lines for phrases that are parts of longer phrases in other lines.

Sort by frequency

Usable only if full phrase or comparison to expected display is selected, this function reformats all lines to place the frequency first and then sorts them into ascending order of frequency or comparative frequency.

Extract from clipboard

This function works like "Extract from..." in the "File" menu, except that it analyzes the text contents of the Windows clipboard instead of a file that you specify.

Options menu

Minimum occurrences

Use "Options|Minimum occurrences..." (Ctrl-M) to reset the minimum number of occurrences for a word or phrase to be included in the display. When you select this option, a "Minimum occurrences" dialog will appear. Type in the number that you want and click on "OK".

Maximum words in phrase

Use "Options|Maximum words in phrase..." to reset the maximum length of a phrase to a value in the range 1 to 20. The higher this number is, the more memory the program will require to run.

Stoplist...

Use "Options|Stoplist..." (Ctrl-L) to open a "Load stoplist" dialog and choose a different stoplist. If you click on "Cancel", the previous stoplist will be retained.

Note that ExtPhr32 does not apply a newly selected stoplist until you extract from a new file or change the minimum occurrence value.

Break words...

Use "Options|Break words..." to open a "Load break words" dialog and choose a list of break words (stopwords across which phrases are not generated).

Note that ExtPhr32 does not apply newly selected break words until you extract from a new file. (Changing the minimum occurrence value does not cause any new break words to be applied.)

Break words do not need to be included in the stoplist.

Expectation file...

Use this item to specify a text file containing words and expected relative frequencies for use with the "comparison to expected" display. The file should be in alphabetical order and each line should consist of a word in uppercase, a tab, and an expected relative frequency expressed as a decimal fraction of all word occurrences.

You can create such a file automatically with ExtPhr32 by doing the following:

  1. create or obtain a large representative text file;
  2. select "Extract from..." from the "File" menu; and extract from this file;
  3. check "Show one-word only" and "Show relative frequency" in the "Options" menu;
  4. select "Save As..." from the "File" menu and save the expectation file.

Break set file...

Use this item to load or cancel a list of decimal ASCII codes across which phrases are not generated. (Any ASCII codes for characters that are recognized as parts of words will be ignored.) The file containing the list should be a plain text file with one code to a line. For example, the following list specifies that no phrases will be generated across carriage returns, exclamation points, commas, periods, colons, semicolons, or question marks:
13
33
44
46
58
59
63

Additional letter set file...

Use this item to load or cancel a list of additional decimal ASCII codes to be recognized as characters in words. For example, the following list provides for additional (uppercase) Polish characters in ISO-8859-2 code:
143
161
163
175
(Here are the actual characters: . If your browser character set is ISO-8859-2, you should see them as Polish letters. If not, you may see them as various mixtures of boxes, Cyrillic characters, and other symbols, or not at all.)

Allow initial numerals

If this menu item is unchecked, all words beginning with any of the numerals 0-9 will be treated as stopwords.

Include extended ASCII

Clicking on this menu item toggles off and on the recognition as characters in words of letters from Western European languages with ASCII codes greater than 127 (such as Ö and é). The standard followed is the more universal ISO 8859-1, not the Extended ASCII used in MS-DOS and still represented in the Windows Terminal font.

Show one-word only

Clicking on this menu item toggles on and off a display giving data only on single word occurrences, excluding multi-word phrases.

Show full phrases

Clicking on this menu item toggles on and off a display in which each phrase is shown in full instead of in the default hierarchical style.

Show multiword only

Clicking on this menu item (available if "Show full phrases" is on) toggles on and off a display giving data only on phrases of two or more words, excluding data on single word occurrences.

Lower case

Clicking on this menu item toggles on and off display in lower case rather than upper case.

Show relative frequency

Clicking on this menu item toggles on and off a display giving word and phrase frequencies as proportions of all word or phrase occurrences rather than as raw counts.

Show comparison to expected

Clicking on this menu item toggles on and off a display giving, in place of word frequencies, values derived by comparing the observed frequencies with expectations in the expectation file. In this display, only single words are shown and only those that are at least as frequent as expected.

Font

This items allows you to select the font used to display extraction results, including the character set. For a good variety of character sets (Western, Arabic, Hebrew, Greek, Baltic, Turkish, Central European, and Cyrillic), try selecting the Arial font.

Tools menu

Compare

This item allows you to compare two files containing analyses saved in either hierarchical or full-phrase style, sending common parts to a third file. You will be prompted for the following: The hierarchical output looks like the following:
UWO 5+5
.CA 5+5
(Do not use this item on two files using different styles.)

Run script...

This item allows you to run a script file consisting of a sequence of command lines. The following commands are currently available:
additionalletterset pathname
breakset pathname
breakwords pathname
collapse
includeextendedascii off
includeextendedascii on
includenumerals off
includenumerals on
expectationfile pathname
extractfrom pathname
font name font name
font size n
font style [bold] [italic] [underline] [strikeout]
font color n
font charset n
lowercase off
lowercase on
maximumwordsinphrase n
minimumoccurrences n
saveas pathname
showcomparisontoexpected off
showcomparisontoexpected on
showfullphrases off
showfullphrases on
showonewordonly off
showonewordonly on
showmultiwordonly off
showmultiwordonly on
showrelativefrequency off
showrelativefrequency on
sortbyfrequency
stoplist pathname
Depending on other font properties, charset values that might be used include 0 (ANSI), 1 (default), 77 (Macintosh), 161 (Greek), 162 (Turkish), 177 (Hebrew), 178 (Arabic), 186 (Baltic), 204 (Cyrillic), 238 (Eastern European), 255 (OEM).

Color numbers may be given in hexadecimal form (e.g., $FF0000 for blue).

When a script is run, disabled or erroneous commands are simply discarded and no error messages are generated.

On startup, the program looks in its folder for a script file Extpconf.txt; if it finds this file, it executes the script. This is useful for setting options that you may prefer as the defaults for your installation. If ExtPhr32 finds Extpconf.txt, it does not prompt you automatically for a stoplist; Extpconf.txt should therefore contain at least a stoplist command.

Similarity

This item allows you to calculate the similarity of vocabulary of two files. The files are analyzed using the current stoplist and word character settings. The result is displayed in a message box, and the main display contents are not changed.
Last updated February 5, 2008, by Tim Craven