Topic-Based Hard Clustering of Documents Using Generative Models

  • Authors:
  • Giovanni Ponti;Andrea Tagarelli

  • Affiliations:
  • Dept. of Electronics, Computer and Systems Sciences, University of Calabria, Italy;Dept. of Electronics, Computer and Systems Sciences, University of Calabria, Italy

  • Venue:
  • ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
  • Year:
  • 2009

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.