The Organisation and Visualisation of Document Corpora: A Probabilistic Approach

  • Authors:
  • M. Girolami;A. Vinokourov;A. Kaban

  • Affiliations:
  • -;-;-

  • Venue:
  • DEXA '00 Proceedings of the 11th International Workshop on Database and Expert Systems Applications
  • Year:
  • 2000

Quantified Score

Hi-index 0.00

Visualization

Abstract

A generic probabilistic framework for the unsupervised organisation and visualisation of document collections is presented. The probabilistic hierarchical clustering of large-scale sparse and high-dimensional data collections is achieved by the development of a family of latent class models which are parameterized using the expectation maximisation algorithm. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically, both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. The subsequent visualisation of document collections is achieved by exploiting the topographic relations between similar documents. A latent trait model is developed which provides the means of viewing vector space document representations on a 2D grid and thereby visualising the inherent structure of the document collection. A number of experiments are provided to demonstrate the technique and a concluding discussion on the proposed models is given.