Document clustering with cluster refinement and model selection capabilities

Authors:
Xin Liu;Yihong Gong;Wei Xu;Shenghuo Zhu
Affiliations:
NEC USA, Inc, Cupertino, CA;NEC USA, Inc, Cupertino, CA;NEC USA, Inc, Cupertino, CA;University of Rochester, Rochester, NY
Venue:
SIGIR '02 Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval
Year:
2002

Citing 10
Cited 35

Recent trends in hierarchic document clustering: a critical review

Information Processing and Management: an International Journal
Identifying word correspondence in parallel texts

HLT '91 Proceedings of the workshop on Speech and Natural Language
Scatter/Gather: a cluster-based approach to browsing large document collections

SIGIR '92 Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval
A study of retrospective and on-line event detection

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
On-line new event detection and tracking

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
Distributional clustering of words for text classification

Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval
X-means: Extending K-means with Efficient Estimation of the Number of Clusters

ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
The Cluster-Abstraction Model: Unsupervised Learning of Topic Hierarchies from Text Data

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Pattern Classification (2nd Edition)

Pattern Classification (2nd Edition)
Distributional clustering of English words

ACL '93 Proceedings of the 31st annual meeting on Association for Computational Linguistics

Document clustering based on non-negative matrix factorization

Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Document clustering by concept factorization

Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
Tracking dynamics of topic trends using a finite mixture model

Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining
Document clustering based on cluster validation

Proceedings of the thirteenth ACM international conference on Information and knowledge management
Text summarization using a trainable summarizer and latent semantic analysis

Information Processing and Management: an International Journal - Special issue: An Asian digital libraries perspective
An initial evaluation of automated organization for digital library browsing

Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries
Document Clustering Using Locality Preserving Indexing

IEEE Transactions on Knowledge and Data Engineering
Adaptive topological tree structure for document organisation and visualisation

Neural Networks - 2004 Special issue: New developments in self-organizing systems
Using cluster validation criterion to identify optimal feature subset and cluster number for document clustering

Information Processing and Management: an International Journal
Regularized clustering for documents

SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
Utilizing phrase-similarity measures for detecting and clustering informative RSS news articles

Integrated Computer-Aided Engineering
Multinomial mixture model with feature selection for text clustering

Knowledge-Based Systems
A Novel Retrieval Refinement and Interaction Pattern by Exploring Result Correlations for Image Retrieval

Adaptive Multimedial Retrieval: Retrieval, User, and Semantics
Unsupervised Text Learning Based on Context Mixture Model with Dirichlet Prior

Advanced Web and NetworkTechnologies, and Applications
Using backward elimination with a new model order reduction algorithm to select best double mixture model for document clustering

Expert Systems with Applications: An International Journal
Semi-supervised Document Clustering with Simultaneous Text Representation and Categorization

ECML PKDD '09 Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part I
Building an automatic annotate image system by using bootstrapping

CATE '07 Proceedings of the 10th IASTED International Conference on Computers and Advanced Technology in Education
A Clustering Framework Based on Adaptive Space Mapping and Rescaling

AIRS '09 Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology
Document Clustering with Cluster Refinement and Non-negative Matrix Factorization

ICONIP '09 Proceedings of the 16th International Conference on Neural Information Processing: Part II
Automatic taxonomy generation: issues and possibilities

IFSA'03 Proceedings of the 10th international fuzzy systems association World Congress conference on Fuzzy sets and systems
Person name disambiguation by bootstrapping

Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
Document clustering using NMF and fuzzy relation

Proceedings of the 5th International Conference on Ubiquitous Information Management and Communication
Integrating Document Clustering and Multidocument Summarization

ACM Transactions on Knowledge Discovery from Data (TKDD)
Double-pass clustering technique for multilingual document collections

Journal of Information Science
Document clustering with universum

Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Multi-view transfer learning with a large margin approach

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Serendipitous learning: learning beyond the predefined label space

Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Clique percolation method for finding naturally cohesive and overlapping document clusters

ICCPOL'06 Proceedings of the 21st international conference on Computer Processing of Oriental Languages: beyond the orient: the research challenges ahead
Non-negative matrix factorization based text mining: feature extraction and classification

ICONIP'06 Proceedings of the 13th international conference on Neural Information Processing - Volume Part II
Natural document clustering by clique percolation in random graphs

AIRS'06 Proceedings of the Third Asia conference on Information Retrieval Technology
Dynamic pattern mining: an incremental data clustering approach

Journal on Data Semantics II
Leveraging network structure for incremental document clustering

APWeb'12 Proceedings of the 14th Asia-Pacific international conference on Web Technologies and Applications
Sentiment detection with auxiliary data

Information Retrieval
Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Journal of Intelligent Information Systems
On Knowledge-Enhanced Document Clustering

International Journal of Information Retrieval Research

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative featuresfor each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.