Scatter/Gather: a cluster-based approach to browsing large document collections
SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A Classification EM algorithm for clustering and two stochastic versions
Computational Statistics & Data Analysis - Special issue on optimization techniques in statistics
Probabilistic latent semantic indexing
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Cluster-based language models for distributed retrieval
Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval
Concept decompositions for large sparse text data using clustering
Machine Learning
Random projection in dimensionality reduction: applications to image and text data
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Co-clustering documents and words using bipartite spectral graph partitioning
Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
The use of unlabeled data to improve supervised learning for text summarization
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Unsupervised document classification using sequential information maximization
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Introduction to Modern Information Retrieval
Introduction to Modern Information Retrieval
Document clustering based on non-negative matrix factorization
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Cluster ensembles --- a knowledge reuse framework for combining multiple partitions
The Journal of Machine Learning Research
The Journal of Machine Learning Research
Proceedings of the 13th international conference on World Wide Web
Document preprocessing for naive Bayes classification and clustering with mixture of multinomials
Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Relation between PLSA and NMF and implications
Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval
A Generalized Maximum Entropy Approach to Bregman Co-clustering and Matrix Approximation
The Journal of Machine Learning Research
Hi-index | 0.00 |
Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent presence of noise in such representation obviously degrades the performance of most of these approaches. In this paper we investigate an unsupervised dimensionality reduction technique for document clustering. This technique is based upon the assumption that terms co-occurring in the same context with the same frequencies are semantically related. On the basis of this assumption we first find term clusters using a classification version of the EM algorithm. Documents are then represented in the space of these term clusters and a multinomial mixture model (MM) is used to build document clusters. We empirically show on four document collections, Reuters-21578, Reuters RCV2-French, 20Newsgroups and WebKB, that this new text representation noticeably increases the performance of the MM model. By relating the proposed approach to the Probabilistic Latent Semantic Analysis (PLSA) model we further propose an extension of the latter in which an extra latent variable allows the model to co-cluster documents and terms simultaneously. We show on these four datasets that the proposed extended version of the PLSA model produces statistically significant improvements with respect to two clustering measures over all variants of the original PLSA and the MM models.