Exploiting Hierarchy in Text Categorization

  • Authors:
  • Andreas S. Weigend;Erik D. Wiener;Jan O. Pedersen

  • Affiliations:
  • Department of Information Systems, Leonard N. Stern School of Business, New York University, 44 West Fourth Street, New York, NY 10012, USA. andreas@weigend.com www.weigend.com;-;InfoSeek Corp., 1399 Moffet Park Drive, Sunnyvale, CA 94089, USA

  • Venue:
  • Information Retrieval
  • Year:
  • 1999

Quantified Score

Hi-index 0.01

Visualization

Abstract

With the recent dramatic increase in electronic access todocuments, text categorization—the task of assigning topics to agiven document—has moved to the center of the information sciencesand knowledge management. This article uses the structure that ispresent in the semantic space of topics in order to improveperformance in text categorization: according to their meaning,topics can be grouped together into “meta-topics”, e.g., gold,silver, and copper are all metals. The proposed architecture matchesthe hierarchical structure of the topic space, as opposed to a flatmodel that ignores the structure. It accommodates both single andmultiple topic assignments for each document. Its probabilisticinterpretation allows its predictions to be combined in a principledway with information from other sources. The first level of thearchitecture predicts the probabilities of the meta-topic groups.This allows the individual models for each topic on the second levelto focus on finer discriminations within the group. Evaluating theperformance of a two-level implementation on the Reuters-22173testbed of newswire articles shows the most significant improvementfor rare classes.