Multilingual Document Clustering, Topic Extraction and Data Transformations

Authors:
Joaquim Ferreira da Silva;João Mexia;Carlos Agra Coelho;José Gabriel Pereira Lopes
Affiliations:
-;-;-;-
Venue:
EPIA '01 Proceedings of the10th Portuguese Conference on Artificial Intelligence on Progress in Artificial Intelligence, Knowledge Extraction, Multi-agent Systems, Logic Programming and Constraint Solving
Year:
2001

Citing 4
Cited 1

Applied multivariate statistical analysis

Applied multivariate statistical analysis
Iterative residual rescaling

Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval
Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units

EPIA '99 Proceedings of the 9th Portuguese Conference on Artificial Intelligence: Progress in Artificial Intelligence
Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies

NAACL-ANLP-AutoSum '00 Proceedings of the 2000 NAACL-ANLP Workshop on Automatic Summarization

Double-pass clustering technique for multilingual document collections

Journal of Information Science

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper describes a statistics-based approach for clustering documents and for extracting cluster topics. Relevant Expressions (REs) are extracted from corpora and used as clustering base features. These features are transformed and then by using an approach based on Principal Components Analysis, a small set of document classification features is obtained. The best number of clusters is found by Model-Based Clustering Analysis. Data transformations to approximate to normal distribution are done and results are discussed. The most important REs are extracted from each cluster and taken as cluster topics.