Fast and effective text mining using linear-time document clustering
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised learning by probabilistic latent semantic analysis
Machine Learning
The Journal of Machine Learning Research
A unified framework for model-based clustering
The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research
The Journal of Machine Learning Research
Soft clustering criterion functions for partitional document clustering: a summary of results
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Generative model-based document clustering: a comparative study
Knowledge and Information Systems
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An extension of PLSA for document clustering
Proceedings of the 17th ACM conference on Information and knowledge management
A statistical model for topically segmented documents
DS'11 Proceedings of the 14th international conference on Discovery science
Hi-index | 0.00 |
In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.