Information Retrieval Based upon Latent Class Analysis

  • Authors:
  • Frank B. Baker

  • Affiliations:
  • University of Wisconsin, Laboratory of Experimental Design, Madison, Wisconsin

  • Venue:
  • Journal of the ACM (JACM)
  • Year:
  • 1962

Quantified Score

Hi-index 0.04

Visualization

Abstract

The application of digital computers to the tasks of document classification, storage and retrieval holds considerable promise for solving the so-called “library problem.” Due to the high-speed and data handling characteristics of digital computers, a number of different approaches to the “library problem” have been placed in operation [4]. Although existing systems are rather rudimentary when compared with the ultimate goal of an automated library, progress towards that goal has been made in several areas: the organization of a mass of documents through automatic indexing schemes; the retrieval from a mass of documents of only those documents related to an information request made by a user of the library. A high proportion of existing document retrieval systems is based upon the author's background and skill rather than upon a mathematical model. Although allowing considerable success in the initial stages of development, the heuristic approach has a limited potential unless an underlying mathematical rationale can be found. Therefore, the present paper proposes an information retrieval based upon Lazarsfeld's latent class analysis [11], which has mathematical foundations. Although latent class analysis was developed by Lazarsfeld [11] to analyze questionnaires, the similarity of this task and document classification suggests that the mathematical rationale for the former could also provide a useful theoretical basis for the latter. Because the number of words contained in even a moderately sized report can exceed the capacity of most computers, some form of data reduction is a necessity. The reduction usually results in one of three levels of abstraction: abstracts of documents, key or topical words which convey the meaning of the document or abstract, and indices or tags based upon key words which are then assigned to the document. In general, indexing systems either assign key words to the document or use several key words to assign tags or indices to the documents. The key words or tags then serve as basic information for a retrieval system. Until a radical change in the data handling characteristics of computers is made, it would appear likely that key words or tags will continue to serve as the raw data for information retrieval systems. Although considerable uniformity exists in basic data introduced into an automated library, many different approaches exist as to the subsequent processing of the data. Several papers are reviewed below, which illustrate some of the considerations that enter into the development of an information retrieval system.Maron and Kuhns [8] have developed the “probabilistic indexing” scheme, which reduces the number of documents searched yet increases the retrieval of appropriate documents. In this approach, a large mass of source documents was read by human reviewers and key words were selected. The key words were then pooled into a few well-defined categories. However, any given key word could appear in more than one category. The resulting categories were then assigned meaningful labels or tags which constituted an index term list. The source documents were then re-inspected and the appropriate tag or tags assigned to the document. Document retrieval using the probabilistic indexing scheme is accomplished by presenting the computer with a series of tags and a value of a relevance number below which documents are not of sufficient importance to be retrieved. The tags locate the document, and the value of the corresponding relevance number compared to the lower bound value determines if the document should be retrieved.The high degree of dependence of the probabilistic indexing scheme upon human reviewers greatly reduces the efficiency of the method. If the number of documents, key words and tags were large, a human reviewer would not be able to maintain a consistent frame of reference when assigning tags and relevance numbers. The unique contribution of the probabilistic indexing scheme, however, is the use of relevance numbers in conjunction with the indices. The number provides a basis for determining the relevance of the stored documents to the indexed terms used by the requester of information.Stiles [10] had also reported the use of an association factor to accompany the index terms assigned to a document. The factor used expressed the discrepancy of the observed joint occurrence from the expected joint occurrence of an index pair, assuming independence. The association factor employed was the &khgr;2 value obtained from a two-fold contingency table involving the pair of index terms. A correlation coefficient, such as tetracortic r which expresses the correlation within the two-fold table, rather than a chi-square value expressing lack of independence would have been more appropriate in the present context. Stiles [10], however, reports that the use of the association factor was found to improve document retrieval. A more intensive study of the inter-relationships among words within a document was performed by Doyle [2]. The joint occurrences of word pairs in a body of 600 documents served as the basic data of the study. Two types of word correlations were found to exist within word pairs: adjacent correlations, resulting from words which appeared in pairs due to the nature of our language; and proximal correlation, due to words which are logically related but appear at non-adjacent positions within a document. The statistical effects of these two correlations were denoted by language redundancy and reality redundancy. In addition, a third type of redundancy, documentation redundancy resulted when more than one document could be classified by a given set of key words. The effect of language redundancy can be reduced by pooling adjacent key words and treating the pair as a single key word, thus eliminating the redundancy. Documentation redundancy would be reduced by pooling similar documents and assigning a single label to the batch, thus eliminating unnecessary duplication of effort. Reality redundancy, however, is the result of the author's cognitive processes, and the degree to which the literature researcher can duplicate this redundancy determines how successfully the original document can be retrieved. This study indicates that an important function in an information retrieval system is machinery for reducing the effects of language and documentation redundancy so that important relationships are not obscured.The results of the three studies reviewed above indicated document retrieval can be improved if the documents are surveyed for document redundancy and if the relationships among the key words are filtered to remove language redundancy. In addition, the use of a relevance number relating the document and key words appears to increase the efficiency of document retrieval.