ESPClust: an effective skew prevention method for model-based document clustering

  • Authors:
  • Xiaoguang Li;Ge Yu;Daling Wang;Yubin Bao

  • Affiliations:
  • School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China;School of Information Science and Engineering, Northeastern University, Shenyang, P.R. China

  • Venue:
  • CICLing'05 Proceedings of the 6th international conference on Computational Linguistics and Intelligent Text Processing
  • Year:
  • 2005

Quantified Score

Hi-index 0.00

Visualization

Abstract

Document clustering is necessary for information retrieval, Web data mining, and Web data management. To support very high dimensionality and the sparsity of document feature, the model-based clustering has been proved to be an intuitive choice for document clustering. However, the current model-based algorithms are prone to generating the skewed clusters, which influence the quality of clustering seriously. In this paper, the reasons of skew generating are examined and determined as the inappropriate initial model, and the interaction between the decentralization of estimation samples and the over-generalized cluster model. An effective clustering skew prevention method (ESPClust) is proposed to focus on the last reason. To break this interaction, for each cluster, ESPClust automatically selects a part of documents that most relevant to its corresponding class as the estimation samples to re-estimate the cluster model. Based on the ESPClust, two algorithms with respect to the quality and efficiency are provided for different kinds of applications. Compared with balanced model-based algorithms, the ESPClust method has less restrictions and more applicability. The experiments show that the ESPClust can avoid the clustering skew in a great degree and its Macro-F1 measure outperforms the previous methods' measure.