Document clustering using linear partitioning hyperplanes and reallocation

  • Authors:
  • Canasai Kruengkrai;Virach Sornlertlamvanich;Hitoshi Isahara

  • Affiliations:
  • Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Pathumthani, Thailand;Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Pathumthani, Thailand;Thai Computational Linguistics Laboratory, National Institute of Information and Communications Technology, Pathumthani, Thailand

  • Venue:
  • AIRS'04 Proceedings of the 2004 international conference on Asian Information Retrieval Technology
  • Year:
  • 2004

Quantified Score

Hi-index 0.00

Visualization

Abstract

This paper presents a novel algorithm for document clustering based on a combinatorial framework of the Principal Direction Divisive Partitioning (PDDP) algorithm [1] and a simplified version of the EM algorithm called the spherical Gaussian EM (sGEM) algorithm. The idea of the PDDP algorithm is to recursively split data samples into two sub-clusters using the hyperplane normal to the principal direction derived from the covariance matrix. However, the PDDP algorithm can yield poor results, especially when clusters are not well-separated from one another. To improve the quality of the clustering results, we deal with this problem by re-allocating new cluster membership using the sGEM algorithm with different settings. Furthermore, based on the theoretical background of the sGEM algorithm, we can naturally extend the framework to cover the problem of estimating the number of clusters using the Bayesian Information Criterion. Experimental results on two different corpora are given to show the effectiveness of our algorithm.