Data were logged from pages from Yahoo!'s random page service and the Google directory; an img tag was extracted randomly from each where present; its alt attribute, if any, was recorded; and the header for the corresponding image file was retrieved if possible. about 17% of pages showed no img elements. Of img tags randomly selected from the remainder, about 48% had alt texts, of which about 27% were null. Of the images for which headers could be retrieved, about 73% were GIF, 24% JPEG; and 1% PNG. GIF images were more commonly assigned null alt texts than JPEG images, and GIF files tended to be shorter than JPEG files. Alt texts for images from pages containing more images tended to be slightly shorter. Possible explanations for the results included GIF files' being more suited to decorative images and the likelihood that many images on image-rich pages were content-poor.
Web pages returned by Yahoo!'s random page service showed generally no significant difference in inclusion of meta tagged descriptions and keywords between pages where a generator was identified and other pages. About 56% had descriptions and 58% keywords. An exception for descriptions were pages created with Yahoo! PageBuilder, all found in the geocities.com domain. Examination of a further sample of URLs restricted to geocities.com showed a significant difference in inclusion of both keywords and descriptions between pages where a generator was identified (mostly Yahoo! PageBuilder or Microsoft FrontPage) and pages lacking generator identification (39% versus 61% for keywords and 28% versus 54% for descriptions). Exact repetition of descriptions or keywords between pages on the same site did not generally correlate significantly with identified generators.
Web pages cited with personal author identification in 12 longer Web bibliographies and a collection of 19 shorter Web bibliographies were investigated. With one exception, the personal author names could be matched in the visible text of the great majority of pages. Meta tags (both for authors and for descriptions) and page titles rarely added any author information. In some cases, frames or inline graphics appeared to be the sources used. Somewhat more frequent probable sources were linked pages, such as home pages.
Using the Yahoo! and Google directories, sets of pages from the top levels of each major subject area were downloaded and analyzed for presence of meta tag descriptions, lengths of descriptions, and degree of match in wording of descriptions to the pages' displayed texts and titles. Results for both directories showed significant differences in proportion of pages with descriptions and in lengths of descriptions depending on subject area; specifically, both health categories showed higher proportions with descriptions.
Sets of top-ranking pages in 19 languages returned by the Google search engine were downloaded and their titles and meta tagged keywords analyzed. Results showed significant differences in proportion of pages with keywords depending on language; specifically, pages in Dutch, French, and German showed the highest proportions with keywords, while pages in Chinese and Korean showed the lowest proportions. Keywords were mostly in the languages of the pages, though on Chinese, Greek, Indonesian, and Turkish pages keywords in English or in English mixed with other languages predominated. The proportion of very long titles also varied significantly with language, with nearly 10% of titles on Russian pages exceeding 100 bytes, in contrast to less than 1% on Chinese, Finnish, Indonesian, and Polish pages. Both standard ASCII extensions and character entity references were used to code special characters in titles.
Sets of top-ranking pages in 20 languages returned by the Google search engine were downloaded and analyzed for presence of meta tag descriptions and lengths of descriptions. Results showed significant differences in proportion of pages with descriptions and in lengths of descriptions depending on language; specifically, pages in major Western European languages showed higher proportions with descriptions, while pages in Chinese showed the lowest proportions. Descriptions were mostly in the languages of the pages, though English descriptions were provided on some non-English pages. With few exceptions, coding schemes adopted for diacritics and non-Roman characters were standard.
Using four previously identified samples of Web pages containing meta tag descriptions, the value of meta tag keywords, the first 200 characters of the body, and text marked with common HTML tags as extracts helpful for writing summaries was estimated by applying two measures: density of description words and density of two-word description phrases. Generally, titles and keywords showed the highest densities. Parts of the body showed densities not much different from the body as a whole: somewhat higher for the first 200 characters and for text tagged with CENTER and FONT; somewhat lower for text tagged with A; not significantly different for TABLE and DIV. Implications of the findings for aids to summarization, and specifically the TexNet32 package, were considered.
Fifteen sets of external descriptions of Web pages were examined for common phrases, general syntactic structure, and content. For the seven largest sets, the value, as extracts helpful for writing external descriptions, of meta tag descriptions and keywords, the first 200 characters of the body, and text marked with common HTML tags was estimated by applying two measures: density of external description words and density of two-word external description phrases. Syntactic patterns were found to vary between sets, with larger sets tending to be more internally consistent. Generally, titles showed the highest match densities; match densities were also generally high for meta tag descriptions and for the first 200 words of the body, and low for text tagged A, with mixed results for keywords and for text tagged B, CENTER, or FONT.
Sixteen Web bibliographies were analyzed for uses of two different recommended sources: (1) the tagged title; (2) the title as it would appear to be from viewing the beginning of the page in the browser (apparent title). In all sixteen, the proportion of tagged titles was much less than that of apparent titles, and only rarely did the bibliography title match the tagged title and not the apparent title. Convenience of copying might partly explain the preference for the apparent title.
Fifteen user queries from AskJeeves were submitted to Excite, Go, HotBot, and Yahoo!, and the top-ranking pages were downloaded and examined for the presence of META tags, especially the DESCRIPTION tag. Results were generally similar to those in previous studies using the Yahoo! random-page service, suggesting that various sampling methods may be used for this sort of study. Go returned significantly fewer pages with META tags than Excite and Hotbot, and Yahoo! returned significantly fewer than HotBot. Go and Yahoo! both returned significantly fewer pages with the DESCRIPTION tag than did Excite, even though Go reported using the tag and Excite claimed to ignore it.
Four sets of previously visited Web pages were revisited one year later. About 74% of pages previously containing meta tag descriptions retained descriptions, and about of 8% of pages previously lacking descriptions gained descriptions. Home pages appeared to both lose and change descriptions more than other pages. About two-thirds of changes involved minor revisions, and changes fell into a wide variety of categories.
Random samples of Web pages registered with Yahoo! and pages reachable from Yahoo!-registered pages were analyzed for use of META tags and specifically those containing descriptions; about 39% of the Yahoo!-registered pages and 27% of the other pages included descriptions in META tags. Some of the descriptions greatly exceeded typical length guidelines of 150 or 200 characters. A minority duplicated exactly phrasing found in the visible text; most repeated some words and phrases. Contrary to advice, pages with less visible text were less likely to have descriptions. Keywords were somewhat more likely to appear nearer the beginning of a description than nearer the end. Noun phrases were more common than complete sentences, especially in the non-registered pages.
To determine patterns of relationships among descriptions on the same site, links were followed automatically from pages previously found to contain descriptions. Sites where the starting page pointed to many other pages were significantly less likely to reuse the same description on those other pages; where different descriptions were used, words from the starting page’s description tended to appear toward the beginnings of other descriptions
Compact graphic display of phrases from the original text was among abstracting assistance features being prototyped in the TEXNET text network management system. Compaction was achieved by embedding subphrases and by enabling the user to select rapidly word by word. Phrases displayed would not necessarily be those selected for automatic indexing. (Paper accepted for publication in 1995.)
Web pages registered with Yahoo! were analyzed for use of META tags and specifically the DESCRIPTION tag; about 57% contained META tags and 26% used the DESCRIPTION tag. Some of the descriptions greatly exceeded typical length guidelines of 150 or 200 characters. A minority duplicated exactly phrasing found in the visible text; most repeated some words and phrases. Noun phrases were slightly more common than complete sentences. Content usually related to responsible corporate bodies and their products and services; information about the page or site itself was included in about one third of descriptions.
Experimental subjects wrote abstracts of articles using a simplified version of the TEXNET abstracting assistance software. In addition to the full text, subjects were presented with either keywords or phrases extracted automatically. The resulting abstracts, and the times taken, were recorded automatically; some additional information was gathered by oral questionnaire. Selected abstracts produced were evaluated on various criteria by independent raters. Results showed considerable variation among subjects, but 37% found the keywords or phrases "quite" or "very" useful in writing their abstracts. Statistical analysis failed to support several hypothesized relations: phrases were not viewed as significantly more helpful than keywords; and abstracting experience did not correlate with originality of wording, approximation of the author abstract, or greater conciseness. Unanticipated strong correlations included Windows experience and writing an abstract like the author's; experience reading abstracts and thinking one had written a good abstract; gender and abstract length; gender and use of words and phrases from the original text. Results also suggested possible modifications to TEXNET.
A research assistant used the TEXNET abstracting assistance software to create abstracts to articles on the Web and also compiled introductory documentation, including a guide to abstracting using computer assistance tools. Problems encountered, tools selected for preferred use, and implications for future software development were considered.
FlipPhr was a 16-bit Windows program that flipped, or rearranged, phrases, or other expressions, in accordance with rules in a grammar. Flipping could be invoked with a single keystroke from within various application programs that allowed cutting and pasting of text. The user could modify the grammar to provide for different kinds of flipping.
Abstracting assistance features were being prototyped in the TEXNET text network management system. Sentence weighting methods available included weighting negatively or positively on stems from a selected passage; weighting on general lists of cue words; adjusting weights of selected segments; and weighting on occurrences of frequent stems. Users could adjust a number of parameters: minimum length of extracts; threshold for a "frequent" word/stem; and amount to adjust a sentence weight for each weighting type.
Automatically generated displays might be an aid to abstractors. Some advantages of three-dimensional (3-d) displays were outlined, with experimental results confirming their reduced distortion of concept space. Features that might set 3-d representations of concept space off from other kinds of 3-d representations were noted. Features of a prototype system for VGA display were described and illustrated.
Comparison of 39 full texts of articles in the ONTAP Computer database with their abstracts showed (1) no significant relationship between abstract length and full-text length, (2) a small, but significant, tendency for abstract words and phrases to concentrate at the beginning of the full text, (3) little use in abstracts of longer verbatim word sequences from full texts. Additional results using the RightWriter style checker were also reported.
After an outline of desirable qualities for graphic representation of sentence dependency structures in texts more than a few sentences long, approaches prototyped in TEXNET were described, illustrated, and compared. Automatic structure simplification and automatic addition of dummy sentences were noted as useful.
Five automatic graph-drawing algorithms, all implemented as options in TEXNET, were evaluated using a sample of texts for which sentence dependency structures had been coded. Evaluation criteria included speed, and the number of crossing arcs, compactness, and suitability for application of selective scaling of the resulting display.
Different automatic abbreviation schemes for text in graphic displays of sentence dependency structure were assessed on a sample data set for compression and ambiguity. "Speedwriting" of words longer than 5 letters yielded a compression to 80% of the source text, with very low ambiguity. This and two other automatic notemaking-like techniques were implemented as options in TEXNET.
A prototype of a fairly simple method for assisting human coders in recognizing sentence dependency structures in texts was developed in TEXNET. The aim was to employ texts plus structure information in automatically generating a variety of independently meaningful extracts on demand.
A method was prototyped in TEXNET for using Boolean queries in automatically deriving customized extracts from a text with semantic dependencies between sentences pre-coded. Each sentence in the structured text was treated as defining a separate extract, consisting of the sentence and all other sentences on which it was directly or indirectly dependent for its meaning. Extracts from a text that satisfied a given Boolean query were merged to eliminate duplicate sentences.
Nonformulaic abstracts with precoded anaphoras were analyzed for structures of semantic dependency between sentences. 30% contained at least one sentence that was dependent for its meaning on more than one other sentence. Automatic structural simplification, based on an assumption about the use of the structures, allowed all but 7% to be represented as trees. At least some branching was found in 67%, a number reduced to 40% by automatic simplification.
Methods of displaying networks of texts for online editing were discussed, and an early version of TEXNET was described.
The idea was explored of using a conventional string indexing source description, together with a special phrase generator, to generate multiple descriptor phrases for inclusion in a database record for online retrieval, including proximity operators. Phrases should bring together groups of syntactically related words. Software using the BRACIS string indexing system was briefly described.
A method was outlined for obtaining independently meaningful abstracts from a single intermediate representation and preliminary results were considered. Implementation involved manual coding enclosing expressions in angular brackets with a number or letter code identifying the concept. Some control of sentence order was possible and appeared to be essential when some ordinary abstracts were coded.
In a new approach to generating string index entries from concept networks (implemented in NETPAD), terms from multi-term search specifications were cited near to the beginning of the entry, while an articulated entry structure indicative of concept relations was retained.
The BRACIS string indexing input coding scheme was designed to indicate some non-tree structures simply. It dealt with parallel, alternate, or coordinate sequences of terms and connectives by listing them all, one after another, separated by semicolons, and bracketing the list. The possibility of nesting made the scheme relatively powerful as well as relatively simple.
Storing and using thesaural information in NETPAD were described. In place of cross-references, indirect entries, the extent of which could be determined by weights specified by searchers, appeared in index displays.
An editor written in PET BASIC and 6502 machine language prevented syntactic errors in NEPHIS input and also provided continuous feedback in the form of initial parts of permutations that would result.
In NETPAD, originally developed for the PET2001-8, an enhanced tringular matrix display allowed easy editing of complex concept subnetworks, such as those using Farradane's relations. The underlying structure facilitated merging with a general concept network.
Simple automatic NEPHIS coding of most descriptive titles generated sufficiently good input strings. The algorithm was based on assumptions about three ranks of delimiter (involving only brief lists) and about phrase-element dependency in title language. NEPHIS facilitated human recoding of titles violating the assumptions.
A single structure for producing various index displays was proposed, with nodes for both concepts and concept links. Extracting subnetworks and structuring index displays were discussed. Details were given on implementation using DECsystem-10 COBOL with ISAM.
Titles of publications of the Canadian Department of Energy, Mines, and Resources were NEPHIS coded. A brief manual was developed, and problems and costs were analyzed. An experiment using simulated queries showed quicker retrieval using a NEPHIS index than KWOC.
A NEPHIS index simulator, written in PET BASIC, was described that generated hypothetical subject descriptions from a single user-supplied input string and used them to produce a simulated index display.
A program written in PET BASIC to show tree displays of NEPHIS input strings was briefly described.
A technique was described for decreasing average classification notation length without sacrificing expressiveness, succession of characteristics, or filing order. The salient node at which notation assignment began could be determined algorithmically, given data on collection bias, even with only part of the hierarchy determined. A dummy value indicated up-hierarchy movement. The technique was especially applicable to specialized collections and facets affected by anthropocentric bias.
Unlike NEPHIS, LIPHIS, also originally implemented in MACRO-10, handled more complex (non-tree) concept relation structures.
NEPHIS was a string indexing system designed to be easy for programmers, indexers, and index users. The original MACRO-10 version did not exceed a 1K memory assignment. FORTRAN code was provided in an appendix.
Subjects presented with full text of an article plus automatically extracted keywords or phrases wrote abstracts using a simplified version of TEXNET. Some additional information was gathered by oral questionnaire. Results showed considerable variation among subjects; 37% found the keywords or phrases "quite" or "very" useful. Phrases were not viewed as significantly more helpful than keywords; and abstracting experience did not correlate with originality of wording, approximation of the author abstract, or greater conciseness. Results also suggested possible modifications to TEXNET.
A thesaurus prototyped in TEXNET was intended to support production of a variety of printed thesaurus displays, as well as automatic weighting of passages and suggestion of alternate terms in abstracting.
An approach to automatically generating concept association maps that might be useful to abstractors used stems as concept surrogates, cooccurrence to define concept links, and a general graph-drawing algorithm (adapted from Watanabe) to position the stems in two dimensions. Some selection of links was necessary to avoid algorithm failure. Selection methods evaluated were (1) overall strongest links (Overall) and (2) strongest links for each stem (Per-stem). Collocating stems by degree of association was similar for both selection methods. The algorithm performed similarly to two-dimensional scaling. Overall was better than Per-stem at matching associations that would actually be used in abstracting, but produced more unconnected stems and more link crossings.
A special-purpose algorithm ("ring") performed better than a general graph-drawing algorithm (adapted from Watanabe) on time and number of crossing arcs.
Ideas on computerized graphic displays of concept networks were surveyed, with emphasis on syntactic and semantic relations and on indexing and information retrieval applications.
A method for producing customized extracts from multiple texts was prototyped in TEXNET. In response to a Boolean query, an initial set of sentences was retrieved using keywords. Depending on the option selected, a table of sentence dependencies could then be used to modify the set in one or more of the following ways: (1) keyword inheritance by dependent sentences; (2) pruning of sentence dependency structures to shorten the extract; (3) adding of sentences to provide context for understanding.
Automatic structure simplification and automatic addition of dummy terms, especially when used together, could simplify graphic thesaurus displays without sacrificing important information and could also assist in correcting poorly structured thesauri. Originally developed for sentence dependency structures in TEXNET, they were subsequently introduced in the THSRS thesaurus management package.
A fairly simple method was implemented in TEXNET for recognizing sentence dependency structures automatically in journalistic prose. Errors could be corrected manually and were rendered less serious by a built-in bias in favor of recall over precision.
Indexers could see the index that they were producing as they worked at it; they could appear to make modifications directly on the index; complementary entries were created, merged, and formatted automatically; and source descriptions could be viewed in index item sequence. As in string indexing, the software generated multiple index entries from one source description. Users could input source descriptions directly; but the software could also reconstruct a source description unambiguously from any index entry.
Two features added to NETPAD were explained: templates; and "subschemas", which were like templates, but hid the explicit link structure and resembled database record structures.
Index display options introduced in NETPAD included control of citation order, heading-subheading boundaries, and connective wording and exclusion of selected aspects of descriptions, while the underlying source data remained constant.
Facets were considered as functions, either descriptive or prescriptive, of linktypes. Prescriptive functions might refer to a database in which the facets were themselves entities with special links to concepts.
In NETPAD, written in PET BASIC, an enhanced tringular matrix display allowed easy editing of complex concept subnetworks, such as those using Farradane's relations. The underlying structure facilitated merging with a general concept network. [A longer account appeared in Journal of Documentation, 38 (1): 29-37 (1982).]
A general concept network structure was developed for generating various index displays. The initial approach was to translate trees derived from the network into NEPHIS input strings, providing ease of implementation, compatibility with previous work, and applicability to cooperative indexing.
A NEPHIS index simulator, written in PET BASIC, was described that generated hypothetical subject descriptions from a single user-supplied input string and used them to produce a simulated index display. [A slightly expanded version of the paper appeared by invitation in International Classification, 7 (1): 21-24 (1980).]
A method of incorporating cross-references in a NEPHIS index was outlined that made use of a sequential-access thesaurus file of NEPHIS-coded records.
The NEPHIS string indexing system was introduced, and some results of an experiment in teaching it to paid subjects were reported.
Assignability of authors to Web pages using either normal browsing procedures or browsing assisted by simple automatic extraction was investigated. Candidate strings were extracted automatically from title elements, meta-tags, and address-like and copyright-like passages. An assistant attempted to identify personal authors by examining the pages themselves and related pages. Specific problems were noted and some refinements to the extraction methods suggested.
Current personal computer usefulness for research was briefly considered, with string indexing as an example.
Last updated January 23, 2008, by Tim Craven