Topic-Based Hard Clustering of Documents Using Generative Models

Authors:
Giovanni Ponti;Andrea Tagarelli
Affiliations:
Dept. of Electronics, Computer and Systems Sciences, University of Calabria, Italy;Dept. of Electronics, Computer and Systems Sciences, University of Calabria, Italy
Venue:
ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Year:
2009

Citing 10
Cited 1

Fast and effective text mining using linear-time document clustering

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Unsupervised learning by probabilistic latent semantic analysis

Machine Learning
Latent dirichlet allocation

The Journal of Machine Learning Research
A unified framework for model-based clustering

The Journal of Machine Learning Research
RCV1: A New Benchmark Collection for Text Categorization Research

The Journal of Machine Learning Research
Soft clustering criterion functions for partitional document clustering: a summary of results

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Generative model-based document clustering: a comparative study

Knowledge and Information Systems
Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)

Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing)
Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
An extension of PLSA for document clustering

Proceedings of the 17th ACM conference on Information and knowledge management

A statistical model for topically segmented documents

DS'11 Proceedings of the 14th international conference on Discovery science

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.