Information Retrieval Based upon Latent Class Analysis

Authors:
Frank B. Baker
Affiliations:
University of Wisconsin, Laboratory of Experimental Design, Madison, Wisconsin
Venue:
Journal of the ACM (JACM)
Year:
1962

Citing 5
Cited 13

On Relevance, Probabilistic Indexing and Information Retrieval

Journal of the ACM (JACM)
The Association Factor in Information Retrieval

Journal of the ACM (JACM)
Automatic Indexing: An Experimental Inquiry

Journal of the ACM (JACM)
Semantic Road Maps for Literature Searchers

Journal of the ACM (JACM)
A survey of languages and systems for information retrieval

Communications of the ACM

Information retrieval using a singular value decomposition model of latent semantic structure

SIGIR '88 Proceedings of the 11th annual international ACM SIGIR conference on Research and development in information retrieval
Generation and search of clustered files

ACM Transactions on Database Systems (TODS)
A Modified Method of Latent Class Analysis for File Organization in Information Retrieval

Journal of the ACM (JACM)
Is Automatic Classification a Reasonable Application of Statistical Analysis of Text?

Journal of the ACM (JACM)
Statistical Discrimination of the Synonymy/Antonymy Relationship Between Words

Journal of the ACM (JACM)
An Analysis of Some Graph Theoretical Cluster Techniques

Journal of the ACM (JACM)
On the expected gain from adjusting matched term retrieval systems

Communications of the ACM
Expanding the editing function in language data processing

Communications of the ACM
Invited papers—1: classification in information storage and retrieval

ACM '65 Proceedings of the 1965 20th national conference
A discriminant method for automatically classifying documents

AFIPS '63 (Fall) Proceedings of the November 12-14, 1963, fall joint computer conference
The direct access search system

AFIPS '63 (Fall) Proceedings of the November 12-14, 1963, fall joint computer conference
Research in automatic generation of classification systems

AFIPS '64 (Spring) Proceedings of the April 21-23, 1964, spring joint computer conference
Training a computer to assign descriptors to documents: experiments in automatic indexing

AFIPS '64 (Spring) Proceedings of the April 21-23, 1964, spring joint computer conference

Quantified Score

Hi-index	0.04

Visualization

Abstract

The application of digital computers to the tasks of document classification, storage and retrieval holds considerable promise for solving the so-called “library problem.” Due to the high-speed and data handling characteristics of digital computers, a number of different approaches to the “library problem” have been placed in operation [4]. Although existing systems are rather rudimentary when compared with the ultimate goal of an automated library, progress towards that goal has been made in several areas: the organization of a mass of documents through automatic indexing schemes; the retrieval from a mass of documents of only those documents related to an information request made by a user of the library. A high proportion of existing document retrieval systems is based upon the author's background and skill rather than upon a mathematical model. Although allowing considerable success in the initial stages of development, the heuristic approach has a limited potential unless an underlying mathematical rationale can be found. Therefore, the present paper proposes an information retrieval based upon Lazarsfeld's latent class analysis [11], which has mathematical foundations. Although latent class analysis was developed by Lazarsfeld [11] to analyze questionnaires, the similarity of this task and document classification suggests that the mathematical rationale for the former could also provide a useful theoretical basis for the latter. Because the number of words contained in even a moderately sized report can exceed the capacity of most computers, some form of data reduction is a necessity. The reduction usually results in one of three levels of abstraction: abstracts of documents, key or topical words which convey the meaning of the document or abstract, and indices or tags based upon key words which are then assigned to the document. In general, indexing systems either assign key words to the document or use several key words to assign tags or indices to the documents. The key words or tags then serve as basic information for a retrieval system. Until a radical change in the data handling characteristics of computers is made, it would appear likely that key words or tags will continue to serve as the raw data for information retrieval systems. Although considerable uniformity exists in basic data introduced into an automated library, many different approaches exist as to the subsequent processing of the data. Several papers are reviewed below, which illustrate some of the considerations that enter into the development of an information retrieval system.Maron and Kuhns [8] have developed the “probabilistic indexing” scheme, which reduces the number of documents searched yet increases the retrieval of appropriate documents. In this approach, a large mass of source documents was read by human reviewers and key words were selected. The key words were then pooled into a few well-defined categories. However, any given key word could appear in more than one category. The resulting categories were then assigned meaningful labels or tags which constituted an index term list. The source documents were then re-inspected and the appropriate tag or tags assigned to the document. Document retrieval using the probabilistic indexing scheme is accomplished by presenting the computer with a series of tags and a value of a relevance number below which documents are not of sufficient importance to be retrieved. The tags locate the document, and the value of the corresponding relevance number compared to the lower bound value determines if the document should be retrieved.The high degree of dependence of the probabilistic indexing scheme upon human reviewers greatly reduces the efficiency of the method. If the number of documents, key words and tags were large, a human reviewer would not be able to maintain a consistent frame of reference when assigning tags and relevance numbers. The unique contribution of the probabilistic indexing scheme, however, is the use of relevance numbers in conjunction with the indices. The number provides a basis for determining the relevance of the stored documents to the indexed terms used by the requester of information.Stiles [10] had also reported the use of an association factor to accompany the index terms assigned to a document. The factor used expressed the discrepancy of the observed joint occurrence from the expected joint occurrence of an index pair, assuming independence. The association factor employed was the &khgr;2 value obtained from a two-fold contingency table involving the pair of index terms. A correlation coefficient, such as tetracortic r which expresses the correlation within the two-fold table, rather than a chi-square value expressing lack of independence would have been more appropriate in the present context. Stiles [10], however, reports that the use of the association factor was found to improve document retrieval. A more intensive study of the inter-relationships among words within a document was performed by Doyle [2]. The joint occurrences of word pairs in a body of 600 documents served as the basic data of the study. Two types of word correlations were found to exist within word pairs: adjacent correlations, resulting from words which appeared in pairs due to the nature of our language; and proximal correlation, due to words which are logically related but appear at non-adjacent positions within a document. The statistical effects of these two correlations were denoted by language redundancy and reality redundancy. In addition, a third type of redundancy, documentation redundancy resulted when more than one document could be classified by a given set of key words. The effect of language redundancy can be reduced by pooling adjacent key words and treating the pair as a single key word, thus eliminating the redundancy. Documentation redundancy would be reduced by pooling similar documents and assigning a single label to the batch, thus eliminating unnecessary duplication of effort. Reality redundancy, however, is the result of the author's cognitive processes, and the degree to which the literature researcher can duplicate this redundancy determines how successfully the original document can be retrieved. This study indicates that an important function in an information retrieval system is machinery for reducing the effects of language and documentation redundancy so that important relationships are not obscured.The results of the three studies reviewed above indicated document retrieval can be improved if the documents are surveyed for document redundancy and if the relationships among the key words are filtered to remove language redundancy. In addition, the use of a relevance number relating the document and key words appears to increase the efficiency of document retrieval.