ExtPhr32
ExtPhr32 extracts every word and every phrase
up to a certain number of words in length
that occurs at least a minimum number of times
in a source text file
and that does not start or end with a stop word.
When run,
ExtPhr32 first displays a "Select stoplist" dialog
(unless it finds a Extpconf.txt file
as described under "Run script..." below).
Click on the "OK" button
to use the default stoplist
Stoplist.txt;
or select your own stoplist and click on "OK";
or click on "Cancel" to use no stoplist.
ExtPhr32 now displays its main window.
File menu
Extract from...
Use "File|Extract from..." (Ctrl-E)
to analyze a text file.
When you select this option,
an "Extract from..." dialog will appear.
Select the file that you want and click on "OK".
If the minimum occurrences value is less than 2,
a "Minimum occurrences < 2" message will appear,
warning you that a runtime error may occur
and asking whether you wish to proceed.
Click on "Yes" or "No"
depending on what you want to do.
If the file is longer than 400,000 bytes,
a "Long file" message will appear.
Click on "Yes" or "No"
depending on what you want to do.
As ExtPhr32 analyzes the file,
it will show its progress
in the first panel of the status bar.
When the analysis is complete,
An information dialog will appear,
showing the value of Simpson's l
(a measure of word repetition)
and the total number of word occurrences in the file;
click on "OK".
ExtPhr32 will now show the occurrences
of the more frequent words and phrases in the text file
that do not begin or end with stop words:
.
The following extract from an ExtPhr32 display
CAMPUS 8
CONTRACTING 5
EMPLOYEES 5
FOOD 9
.SERVICES 7
shows, for example,
that the source file contained 9 occurrences
of the word "food"
and 7 occurrences of the phrase "food services".
Because reasons for capitalization vary in languages using Roman alphabets,
ExtPhr32 converts all letters to upper case
to provide a standard form for each word or phrase.
The rule followed for case conversion
depends on the locale or language setting of your machine.
Save as...
Use "File|Save as..." (Ctrl-S)
to save the displayed analysis in a file.
Print...
Use "File|Print" (Ctrl-P)
to print the displayed analysis.
When you select this option,
ExtPhr32 will display a "Print" dialog,
to allow you to specify which printer or other device
you prefer.
Click on "OK" to proceed with printing
or "Cancel" if you decide not to print.
Exit
Use "File|Exit"
when you are finished with ExtPhr32.
Edit menu
Copy
Before using "Edit|Copy" (Ctrl-C),
select the part of the displayed analysis that you want
by dragging with the mouse
or using Shift with the cursor control keys
or by selecting "Edit|Select all".
Then use "Edit|Copy"
to copy the selected text to the Windows clipboard.
Copy as keywords
This is a variation of "Copy"
that strips out the numbers
and converts the words and phrases selected
into a lower-case list partitioned by commas,
suitable for pasting into the content
of an HTML "keywords" meta tag.
It is disabled if "Show full phrases" is off.
Select all
Use "Edit|Select all"
to select all of the displayed analysis.
Collapse
Usable only if full phrase display is selected.
This function eliminates all lines for phrases
that are parts of longer phrases in other lines.
Sort by frequency
Usable only if full phrase or comparison to expected display is selected,
this function reformats all lines to place the frequency first
and then sorts them into ascending order of frequency
or comparative frequency.
Extract from clipboard
This function works like "Extract from..."
in the "File" menu,
except that it analyzes the text contents of the Windows clipboard
instead of a file that you specify.
Options menu
Minimum occurrences
Use "Options|Minimum occurrences..." (Ctrl-M)
to reset the minimum number of occurrences
for a word or phrase to be included in the display.
When you select this option,
a "Minimum occurrences" dialog will appear.
Type in the number that you want
and click on "OK".
Maximum words in phrase
Use "Options|Maximum words in phrase..."
to reset the maximum length of a phrase
to a value in the range 1 to 20.
The higher this number is,
the more memory the program will require to run.
Stoplist...
Use "Options|Stoplist..." (Ctrl-L)
to open a "Load stoplist" dialog
and choose a different stoplist.
If you click on "Cancel",
the previous stoplist will be retained.
Note that ExtPhr32 does not apply a newly selected stoplist
until you extract from a new file
or change the minimum occurrence value.
Break words...
Use "Options|Break words..."
to open a "Load break words" dialog
and choose a list of break words
(stopwords across which phrases are not generated).
Note that ExtPhr32 does not apply newly selected break words
until you extract from a new file.
(Changing the minimum occurrence value
does not cause any new break words to be applied.)
Break words do not need to be included in the stoplist.
Expectation file...
Use this item
to specify a text file containing words and expected relative frequencies
for use with the "comparison to expected" display.
The file should be in alphabetical order
and each line should consist of a word in uppercase,
a tab, and an expected relative frequency
expressed as a decimal fraction of all word occurrences.
You can create such a file automatically with ExtPhr32
by doing the following:
- create or obtain a large representative text file;
- select "Extract from..."
from the "File" menu;
and extract from this file;
- check "Show one-word only"
and "Show relative frequency"
in the "Options" menu;
- select "Save As..."
from the "File" menu
and save the expectation file.
Break set file...
Use this item to load or cancel a list of decimal ASCII codes
across which phrases are not generated.
(Any ASCII codes for characters that are recognized as parts of words
will be ignored.)
The file containing the list
should be a plain text file
with one code to a line.
For example,
the following list specifies that no phrases will be generated
across carriage returns, exclamation points, commas, periods, colons, semicolons, or question marks:
13
33
44
46
58
59
63
Additional letter set file...
Use this item to load or cancel a list of additional decimal ASCII codes
to be recognized as characters in words.
For example,
the following list provides for additional (uppercase) Polish characters
in ISO-8859-2 code:
143
161
163
175
(Here are the actual characters:
.
If your browser character set is ISO-8859-2,
you should see them as Polish letters.
If not,
you may see them as various mixtures of boxes,
Cyrillic characters, and other symbols,
or not at all.)
Allow initial numerals
If this menu item is unchecked,
all words beginning with any of the numerals 0-9
will be treated as stopwords.
Include extended ASCII
Clicking on this menu item
toggles off and on the recognition as characters in words
of letters from Western European languages
with ASCII codes greater than 127
(such as Ö and é).
The standard followed is the more universal ISO 8859-1,
not the Extended ASCII used in MS-DOS
and still represented in the Windows Terminal font.
Show one-word only
Clicking on this menu item
toggles on and off a display
giving data only on single word occurrences,
excluding multi-word phrases.
Show full phrases
Clicking on this menu item
toggles on and off a display in which each phrase
is shown in full
instead of in the default hierarchical style.
Show multiword only
Clicking on this menu item
(available if "Show full phrases" is on)
toggles on and off a display
giving data only on phrases of two or more words,
excluding data on single word occurrences.
Lower case
Clicking on this menu item
toggles on and off display in lower case
rather than upper case.
Show relative frequency
Clicking on this menu item
toggles on and off a display
giving word and phrase frequencies as proportions
of all word or phrase occurrences
rather than as raw counts.
Show comparison to expected
Clicking on this menu item
toggles on and off a display
giving, in place of word frequencies,
values derived by comparing the observed frequencies
with expectations in the expectation file.
In this display,
only single words are shown
and only those that are at least as frequent as expected.
Font
This items allows you to select the font used
to display extraction results,
including the character set.
For a good variety of character sets
(Western, Arabic, Hebrew, Greek, Baltic, Turkish, Central European,
and Cyrillic),
try selecting the Arial font.
Tools menu
Compare
This item allows you to compare two files
containing analyses saved in either hierarchical or full-phrase style,
sending common parts to a third file.
You will be prompted for the following:
- input file 1,
- input file 2,
- output file.
The hierarchical output looks like the following:
UWO 5+5
.CA 5+5
(Do not use this item on two files using different styles.)
Run script...
This item allows you to run a script file
consisting of a sequence of command lines.
The following commands are currently available:
additionalletterset pathname
breakset pathname
breakwords pathname
collapse
includeextendedascii off
includeextendedascii on
includenumerals off
includenumerals on
expectationfile pathname
extractfrom pathname
font name font name
font size n
font style [bold] [italic] [underline] [strikeout]
font color n
font charset n
lowercase off
lowercase on
maximumwordsinphrase n
minimumoccurrences n
saveas pathname
showcomparisontoexpected off
showcomparisontoexpected on
showfullphrases off
showfullphrases on
showonewordonly off
showonewordonly on
showmultiwordonly off
showmultiwordonly on
showrelativefrequency off
showrelativefrequency on
sortbyfrequency
stoplist pathname
Depending on other font properties,
charset values that might be used include
0 (ANSI), 1 (default), 77 (Macintosh),
161 (Greek), 162 (Turkish), 177 (Hebrew), 178 (Arabic),
186 (Baltic), 204 (Cyrillic), 238 (Eastern European), 255 (OEM).
Color numbers may be given in hexadecimal form
(e.g., $FF0000 for blue).
When a script is run,
disabled or erroneous commands are simply discarded
and no error messages are generated.
On startup,
the program looks in its folder
for a script file Extpconf.txt;
if it finds this file,
it executes the script.
This is useful for setting options
that you may prefer as the defaults
for your installation.
If ExtPhr32 finds Extpconf.txt,
it does not prompt you automatically for a stoplist;
Extpconf.txt should therefore contain at least
a stoplist command.
Similarity
This item allows you to calculate the similarity of vocabulary
of two files.
The files are analyzed using the current stoplist
and word character settings.
The result is displayed in a message box,
and the main display contents are not changed.
Last updated February 5, 2008, by
Tim Craven