Representation and Learning in Information Retrieval

  • Authors:
  • David D Lewis

  • Affiliations:
  • -

  • Venue:
  • Representation and Learning in Information Retrieval
  • Year:
  • 1991

Quantified Score

Hi-index 0.00

Visualization

Abstract

This dissertation introduces a new theoretical model for text classification systems, including systems for document retrieval, automated indexing, elec- tronic mail filtering, and similar tasks. The Concept Learning model emphasizes the role manual and automated feature selection and classifier formation in text classification. It enables drawing on results from statistics and machine learning in explaining the effectiveness of alternate representations of text, and specifies desirable characteristics of text representations. The use of syntactic parsing to produce indexing phrases has been widely investigated as a possible route to better text representations. Experiments with syntactic phrase indexing, however, have never yielded significant improve- ments in text retrieval performance. The Concept Learning model suggests that the poor statistical characteristics of a syntactic indexing phrase representation negate its dsirable semantic characteristics. The application of term clustering to this representation to improve its statistical properties while retaining its desirable meaning properties is proposed. Standard term clustering strategies from information retrieval (IR), based on cooccurence of indexing terms in documents or groups of documents, were tested on a syntactic indexing phrase representation. In experiments using a standard text retrieval test collection, small effectiveness improvements were obtained. As a means of evaluating representation quality, a text retrieval test collection introduces a number of confounding factors. In contrast, the text categorization task allows much cleaner determination of text representation properties. In preparation for the use of text categorization to study text representation, a more effective and theoretically well-founded probablistic text categorization algorithm was developed, building on work by Maron, Fuhr, and others. Text categorization experiments supported a number of predictions of the Concept Learning model about properties of phrasal representations, includ- ing dimensionality properties not previously measured for text representations. However, in carefully controlled experiments using syntactic phrases produced by Church''s stochastic bracketer, in conjunction with reciprocal nearest neighbor clustering, term clustering was found to produce essentially no improvement in the properties of the phrasal representation. New cluster analysis approaches are proposed to remedy the problems found in traditional term clustering methods.