Algorithms for Model-Based Gaussian Hierarchical Clustering
SIAM Journal on Scientific Computing
A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs
SIAM Journal on Scientific Computing
ACM Computing Surveys (CSUR)
A general probabilistic framework for clustering individuals and objects
Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
An experimental comparison of model-based clustering methods
Machine Learning
Concept decompositions for large sparse text data using clustering
Machine Learning
Bayesian Clustering by Dynamics
Machine Learning - Special issue: Unsupervised learning
ICML '02 Proceedings of the Nineteenth International Conference on Machine Learning
When Is ''Nearest Neighbor'' Meaningful?
ICDT '99 Proceedings of the 7th International Conference on Database Theory
On the Surprising Behavior of Distance Metrics in High Dimensional Spaces
ICDT '01 Proceedings of the 8th International Conference on Database Theory
Model-based Clustering with Soft Balancing
ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
A unified framework for model-based clustering
The Journal of Machine Learning Research
Probabilistic model-based clustering of complex data
Probabilistic model-based clustering of complex data
Model-based hierarchical clustering
UAI'00 Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence
An optimized k-means algorithm of reducing cluster intra-dissimilarity for document clustering
WAIM'05 Proceedings of the 6th international conference on Advances in Web-Age Information Management
Hi-index | 0.00 |
Document clustering is necessary for information retrieval, Web data mining, and Web data management. To support very high dimensionality and the sparsity of document feature, the model-based clustering has been proved to be an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew generating are examined and determined as the inappropriate initial model, and the interaction between the decentralization of estimation samples and the over-generalized cluster model. An effective clustering skew prevention method (ESPClust) is proposed to focus on the last reason. To break this interaction, for each cluster, ESPClust automatically selects a part of documents that most relevant to its corresponding class as the estimation samples to re-estimate the cluster model. Based on the ESPClust, two algorithms with respect to the quality and efficiency are provided for different kinds of applications. Compared with balanced model-based algorithms, the ESPClust method has less restrictions and more applicability. The experiments show that the ESPClust can avoid the clustering skew in a great degree and its Macro-F1 measure outperforms the previous methods' measure.