ESPClust: an effective skew prevention method for model-based document clustering

Authors:
Xiaoguang Li;Ge Yu;Daling Wang;Yubin Bao
Affiliations:
School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China
Venue:
CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
Year:
2005

Citing 14
Cited 1

Algorithms for Model-Based Gaussian Hierarchical Clustering

SIAM Journal on Scientific Computing
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
Data clustering: a review

ACM Computing Surveys (CSUR)
A general probabilistic framework for clustering individuals and objects

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An experimental comparison of model-based clustering methods

Machine Learning
Concept decompositions for large sparse text data using clustering

Machine Learning
Bayesian Clustering by Dynamics

Machine Learning - Special issue: Unsupervised learning
Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Model-based Clustering with Soft Balancing

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A unified framework for model-based clustering

The Journal of Machine Learning Research
Probabilistic model-based clustering of complex data

Probabilistic model-based clustering of complex data
Model-based hierarchical clustering

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence

An optimized k-means algorithm of reducing cluster intra-dissimilarity for document clustering

WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management

Quantified Score

Hi-index	0.00

Visualization

Abstract

Document clustering is necessary for information retrieval, Web data mining, and Web data management. To support very high dimensionality and the sparsity of document feature, the model-based clustering has been proved to be an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew generating are examined and determined as the inappropriate initial model, and the interaction between the decentralization of estimation samples and the over-generalized cluster model. An effective clustering skew prevention method (ESPClust) is proposed to focus on the last reason. To break this interaction, for each cluster, ESPClust automatically selects a part of documents that most relevant to its corresponding class as the estimation samples to re-estimate the cluster model. Based on the ESPClust, two algorithms with respect to the quality and efficiency are provided for different kinds of applications. Compared with balanced model-based algorithms, the ESPClust method has less restrictions and more applicability. The experiments show that the ESPClust can avoid the clustering skew in a great degree and its Macro-F1 measure outperforms the previous methods' measure.