Document clustering via dirichlet process mixture model with feature selection

Authors:
Guan Yu;Ruizhang Huang;Zhaojun Wang
Affiliations:
The Hong Kong Polytechnic University, Hong Kong, Hong Kong;The Hong Kong Polytechnic University, Hong Kong, Hong Kong;Nankai University, Tianjin, China
Venue:
Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining
Year:
2010

Citing 6
Cited 4

Text Classification from Labeled and Unlabeled Documents using EM

Machine Learning - Special issue on information retrieval
Concept decompositions for large sparse text data using clustering

Machine Learning
Simultaneous Feature Selection and Clustering Using Mixture Models

IEEE Transactions on Pattern Analysis and Machine Intelligence
Modeling word burstiness using the Dirichlet distribution

ICML '05 Proceedings of the 22nd international conference on Machine learning
Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

ICML '06 Proceedings of the 23rd international conference on Machine learning
Semi-supervised model-based document clustering: A comparative study

Machine Learning

Representing document as dependency graph for document clustering

Proceedings of the 20th ACM international conference on Information and knowledge management
Semantic Labelling for Document Feature Patterns Using Ontological Subjects

WI-IAT '12 Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology - Volume 01
Mapping semantic knowledge for unsupervised text categorisation

ADC '13 Proceedings of the Twenty-Fourth Australasian Database Conference - Volume 137
Document clustering using dirichlet process mixture model of von Mises-Fisher distributions

Proceedings of the Fourth Symposium on Information and Communication Technology

Quantified Score

Hi-index	0.00

Visualization

Abstract

One essential issue of document clustering is to estimate the appropriate number of clusters for a document collection to which documents should be partitioned. In this paper, we propose a novel approach, namely DPMFS, to address this issue. The proposed approach is designed 1) to group documents into a set of clusters while the number of document clusters is determined by the Dirichlet process mixture model automatically; 2) to identify the discriminative words and separate them from irrelevant noise words via stochastic search variable selection technique. We explore the performance of our proposed approach on both a synthetic dataset and several realistic document datasets. The comparison between our proposed approach and stage-of-the-art document clustering approaches indicates that our approach is robust and effective for document clustering.