Document Clustering and Cluster Topic Extraction in Multilingual Corpora

  • Authors:
  • Joaquim Ferreira da Silva;João Mexia;Carlos Agra Coelho;José Gabriel Pereira Lopes

  • Affiliations:
  • -;-;-;-

  • Venue:
  • ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
  • Year:
  • 2001

Quantified Score

Hi-index 0.00

Visualization

Abstract

A statistics-based approach for clustering documents and for extracting cluster topics is described. Relevant (meaningful) Expressions (REs) automatically extracted from corpora are used as clustering base features. These features are transformed and its number is strongly reduced in order to obtain a small set of document classificationfeatures. This is achieved on the basis of PrincipalComponents Analysis. Model-Based Clustering Analysis finds thebest number of clusters. Then, the most important REs are extracted from each cluster and taken as document cluster topics.