Generative model-based document clustering: a comparative study

Authors:
Shi Zhong;Joydeep Ghosh
Affiliations:
Department of Computer Science and Engineering, Florida Atlantic University, Boca Raton, FL, USA;Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX, USA
Venue:
Knowledge and Information Systems
Year:
2005

Citing 0
Cited 40

Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Semi-supervised model-based document clustering: A comparative study

Machine Learning
A spectral clustering approach to optimally combining numericalvectors with a modular network

Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Short communication: Variable space hidden Markov model for topic detection and analysis

Knowledge-Based Systems
Top-Down Parameter-Free Clustering of High-Dimensional Categorical Data

IEEE Transactions on Knowledge and Data Engineering
Combinational collaborative filtering for personalized community recommendation

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
SAIL: summation-based incremental learning for information-theoretic clustering

Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles

Integrated Computer-Aided Engineering
Document Clustering by Semantic Smoothing and Dynamic Growing Cell Structure (DynGCS) for Biomedical Literature

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
External validation measures for K-means clustering: A data distribution perspective

Expert Systems with Applications: An International Journal
Harmony K-means algorithm for document clustering

Data Mining and Knowledge Discovery
Exploiting Wikipedia as external knowledge for document clustering

Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
Field independent probabilistic model for clustering multi-field documents

Information Processing and Management: an International Journal
Topic-Based Hard Clustering of Documents Using Generative Models

ISMIS '09 Proceedings of the 18th International Symposium on Foundations of Intelligent Systems
Semantic smoothing of document models for agglomerative clustering

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Knowledge transfer on hybrid graph

IJCAI'09 Proceedings of the 21st international jont conference on Artifical intelligence
Multi-grain hierarchical topic extraction algorithm for text mining

Expert Systems with Applications: An International Journal
A probabilistic model for clustering text documents with multiple fields

ECIR'07 Proceedings of the 29th European conference on IR research
Nonnegative Matrix Factorization on Orthogonal Subspace

Pattern Recognition Letters
Hierarchical clustering for topic analysis based on variable feature selection

FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Text stream clustering algorithm based on adaptive feature selection

Expert Systems with Applications: An International Journal
Semantic multi-grain mixture topic model for text analysis

Expert Systems with Applications: An International Journal
Document clustering using synthetic cluster prototypes

Data & Knowledge Engineering
Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization

Information Sciences: an International Journal
Integrating Document Clustering and Multidocument Summarization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Supporting effective health and biomedical information retrieval and navigation: A novel facet view interface evaluation

Journal of Biomedical Informatics
A language-independent approach to identify the named entities in under-resourced languages and clustering multilingual documents

CLEF'11 Proceedings of the Second international conference on Multilingual and multimodal information access evaluation
A statistical model for topically segmented documents

DS'11 Proceedings of the 14th international conference on Discovery science
Topics modeling based on selective Zipf distribution

Expert Systems with Applications: An International Journal
Accelerated multiplicative updates and hierarchical als algorithms for nonnegative matrix factorization

Neural Computation
Wikipedia-based smoothing for enhancing text clustering

AIRS'11 Proceedings of the 7th Asia conference on Information Retrieval Technology
Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Journal of Intelligent Information Systems
Live and learn from mistakes: A lightweight system for document classification

Information Processing and Management: an International Journal
Measuring the coverage and redundancy of information search services on e-commerce platforms

Electronic Commerce Research and Applications
Discrete-Time hopfield neural network based text clustering algorithm

ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part I
Fuzzy semi-supervised co-clustering for text documents

Fuzzy Sets and Systems
Towards information-theoretic K-means clustering for image indexing

Signal Processing
Document classification using semi-supervived mixture model of von Mises-Fisher distributions on document manifold

Proceedings of the Fourth Symposium on Information and Communication Technology
Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

Proceedings of the Fourth Symposium on Information and Communication Technology
A continuous characterization of the maximum-edge biclique problem

Journal of Global Optimization

Quantified Score

Hi-index	0.01

Visualization

Abstract

This paper presents a detailed empirical study of 12 generative approaches to text clustering, obtained by applying four types of document-to-cluster assignment strategies (hard, stochastic, soft and deterministic annealing (DA) based assignments) to each of three base models, namely mixtures of multivariate Bernoulli, multinomial, and von Mises-Fisher (vMF) distributions. A large variety of text collections, both with and without feature selection, are used for the study, which yields several insights, including (a) showing situations wherein the vMF-centric approaches, which are based on directional statistics, fare better than multinomial model-based methods, and (b) quantifying the trade-off between increased performance of the soft and DA assignments and their increased computational demands. We also compare all the model-based algorithms with two state-of-the-art discriminative approaches to document clustering based, respectively, on graph partitioning (CLUTO) and a spectral coclustering method. Overall, DA and CLUTO perform the best but are also the most computationally expensive. The vMF models provide good performance at low cost while the spectral coclustering algorithm fares worse than vMF-based methods for a majority of the datasets.