MMPClust: a skew prevention algorithm for model-based document clustering

Authors:
Xiaoguang Li;Ge Yu;Daling Wang
Affiliations:
School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R.China
Venue:
DASFAA'05 Proceedings of the 10th international conference on Database Systems for Advanced Applications
Year:
2005

Citing 14
Cited 0

Algorithms for Model-Based Gaussian Hierarchical Clustering

SIAM Journal on Scientific Computing
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs

SIAM Journal on Scientific Computing
Data clustering: a review

ACM Computing Surveys (CSUR)
A general probabilistic framework for clustering individuals and objects

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An experimental comparison of model-based clustering methods

Machine Learning
Concept decompositions for large sparse text data using clustering

Machine Learning
Bayesian Clustering by Dynamics

Machine Learning - Special issue: Unsupervised learning
Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach

ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
When Is ''Nearest Neighbor'' Meaningful?

ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces

ICDT '01 Proceedings of the 8th International Conference on Database Theory
Model-based Clustering with Soft Balancing

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A unified framework for model-based clustering

The Journal of Machine Learning Research
Probabilistic model-based clustering of complex data

Probabilistic model-based clustering of complex data
Model-based hierarchical clustering

UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence

Quantified Score

Hi-index	0.00

Visualization

Abstract

To support very high dimensionality, model-based clustering is an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew are examined and determined as the inappropriate initial model, the unfitness of cluster model and the interaction between the decentralization of estimation samples and the over-generalized cluster model. This paper proposes a skew prevention document-clustering algorithm (MMPClust), which has two features: (1) a content-based cluster model is used to model the cluster better; (2) at the re-estimation step, a part of documents most relevant to its corresponding class are selected automatically for each cluster as the estimation samples to break this interaction. MMPClust has less restrictions and more applicability in document clustering than the previous methods.