A Probabilistic Framework for the Hierarchic Organisation and Classification of Document Collections

  • Authors:
  • Alexei Vinokourov;Mark Girolami

  • Affiliations:
  • Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20 0EX, UK. alexei@cs.rhul.ac.uk;School of Communication and Information Technologies, University of Paisley, High Street, Paisley, PA1 2BE, UK. mark.girolami@paisley.ac.uk

  • Venue:
  • Journal of Intelligent Information Systems
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a probabilistic mixture modeling framework for the hierarchic organisation of document collections. It is demonstrated that the probabilistic corpus model which emerges from the automatic or unsupervised hierarchical organisation of a document collection can be further exploited to create a kernel which boosts the performance of state-of-the-art support vector machine document classifiers. It is shown that the performance of such a classifier is further enhanced when employing the kernel derived from an appropriate hierarchic mixture model used for partitioning a document corpus rather than the kernel associated with a flat non-hierarchic mixture model. This has important implications for document classification when a hierarchic ordering of topics exists. This can be considered as the effective combination of documents with no topic or class labels (unlabeled data), labeled documents, and prior domain knowledge (in the form of the known hierarchic structure), in providing enhanced document classification performance.