Algorithms for Model-Based Gaussian Hierarchical Clustering
SIAM Journal on Scientific Computing
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
ACM Computing Surveys (CSUR)
A general probabilistic framework for clustering individuals and objects
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An experimental comparison of model-based clustering methods
Machine Learning
Concept decompositions for large sparse text data using clustering
Machine Learning
Bayesian Clustering by Dynamics
Machine Learning - Special issue: Unsupervised learning
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
When Is ''Nearest Neighbor'' Meaningful?
ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces
ICDT '01 Proceedings of the 8th International Conference on Database Theory
Model-based Clustering with Soft Balancing
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A unified framework for model-based clustering
The Journal of Machine Learning Research
Probabilistic model-based clustering of complex data
Probabilistic model-based clustering of complex data
Model-based hierarchical clustering
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
Hi-index | 0.00 |
To support very high dimensionality, model-based clustering is an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew are examined and determined as the inappropriate initial model, the unfitness of cluster model and the interaction between the decentralization of estimation samples and the over-generalized cluster model. This paper proposes a skew prevention document-clustering algorithm (MMPClust), which has two features: (1) a content-based cluster model is used to model the cluster better; (2) at the re-estimation step, a part of documents most relevant to its corresponding class are selected automatically for each cluster as the estimation samples to break this interaction. MMPClust has less restrictions and more applicability in document clustering than the previous methods.