Scalable Model-based Clustering by Working on Data Summaries

  • Authors:
  • Huidong Jin;Man-Leung Wong;Kwong-Sak Leung

  • Affiliations:
  • -;-;-

  • Venue:
  • ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The scalability problem in data mining involves the developmentof methods for handling large databases withlimited computational resources. In this paper, we presenta two-phase scalable model-based clustering framework:First, a large data set is summed up into sub-clusters; Then,clusters are directly generated from the summary statisticsof sub-clusters by a specifically designed Expectation-Maximization(EM) algorithm. Taking example for Gaussianmixture models, we establish a provably convergentEM algorithm, EMADS, which embodies cardinality, mean,and covariance information of each sub-cluster explicitly.Combining with different data summarization procedures,EMADS is used to construct two clustering systems:gEMADS and bEMADS. The experimental results demonstratethat they run several orders of magnitude faster thanthe classic EM algorithm with little loss of accuracy. Theygenerate significantly better results than other model-basedclustering systems using similar computational resources.