Document clustering based on cluster validation

Authors:
Zheng-Yu Niu;Dong-Hong Ji;Chew-Lim Tan
Affiliations:
Institute for Infocomm Research, Singapore;Institute for Infocomm Research, Singapore;National University of Singapore, Singapore
Venue:
Proceedings of the thirteenth ACM international conference on Information and knowledge management
Year:
2004

Citing 18
Cited 6

Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
Floating search methods in feature selection

Pattern Recognition Letters
Hierarchic document classification using Ward's clustering method

Proceedings of the 9th annual international ACM SIGIR conference on Research and development in information retrieval
Projections for efficient document clustering

Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval
Web document clustering: a feasibility demonstration

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering using word clusters via the information bottleneck method

SIGIR '00 Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval
Co-clustering documents and words using bipartite spectral graph partitioning

Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining
Bipartite graph partitioning and data clustering

Proceedings of the tenth international conference on Information and knowledge management
Unsupervised document classification using sequential information maximization

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with cluster refinement and model selection capabilities

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Document clustering with committees

SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Model Selection in Unsupervised Learning with Applications To Document Clustering

ICML '99 Proceedings of the Sixteenth International Conference on Machine Learning
Feature Subset Selection and Order Identification for Unsupervised Learning

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
A Min-max Cut Algorithm for Graph Partitioning and Data Clustering

ICDM '01 Proceedings of the 2001 IEEE International Conference on Data Mining
Feature Weighting in k-Means Clustering

Machine Learning
Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Information-theoretic co-clustering

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Resampling Method for Unsupervised Estimation of Cluster Validity

Neural Computation

Document re-ranking using cluster validation and label propagation

CIKM '06 Proceedings of the 15th ACM international conference on Information and knowledge management
Chinese multi-document summarization using adaptive clustering and global search strategy

PRICAI'06 Proceedings of the 9th Pacific Rim international conference on Artificial intelligence
An optimized k-means algorithm of reducing cluster intra-dissimilarity for document clustering

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Multi-document summarization using a clustering-based hybrid strategy

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Finding the optimal cardinality value for information bottleneck method

ADMA'06 Proceedings of the Second international conference on Advanced Data Mining and Applications
Automatic relation extraction with model order selection and discriminative label identification

IJCNLP'05 Proceedings of the Second international joint conference on Natural Language Processing

Quantified Score

Hi-index	0.00

Visualization

Abstract

This paper presents a cluster validation based document clustering algorithm, which is capable of identifying both important feature words and true model order (cluster number). Important feature subset is selected by optimizing a cluster validity criterion subject to some constraint. For achieving model order identification capability, this feature selection procedure is conducted for each possible value of cluster number. The feature subset and cluster number which maximize the cluster validity criterion are chosen as our answer. We have applied our algorithm to several datasets from 20Newsgroup corpus. Experimental results show that our algorithm can find important feature subset, estimate the model order and yield higher micro-averaged precision than other four document clustering algorithms which require cluster number to be provided.