Global and Componentwise Extrapolation for Accelerating Data Mining from Large Incomplete Data Sets with the EM Algorithm

  • Authors:
  • Chun-Nan Hsu;Han-Shen Huang;Bo-Hou Yang

  • Affiliations:
  • Academia Sinica, Taiwan;Academia Sinica, Taiwan;Academia Sinica, Taiwan

  • Venue:
  • ICDM '06 Proceedings of the Sixth International Conference on Data Mining
  • Year:
  • 2006

Quantified Score

Hi-index 0.00

Visualization

Abstract

The Expectation-Maximization (EM) algorithm is one of the most popular algorithms for data mining from incomplete data. However, when applied to large data sets with a large proportion of missing data, the EM algorithm may converge slowly. The triple jump extrapolation method can effectively accelerate the EM algorithm by substantially reducing the number of iterations required for EM to converge. There are two options for the triple jump method, global extrapolation (TJEM) and componentwise extrapolation (CTJEM). We tried these two methods for a variety of probabilistic models and found that in general, global extraplolation yields a better performance, but there are cases where componentwise extrapolation yields very high speedup. In this paper, we investigate when componentwise extrapolation should be preferred. We conclude that, when the Jacobian of the EM mapping is diagonal or block diagonal, CTJEM should be preferred. We show how to determine whether a Jacobian is diagonal or block diagonal and experimentally confirm our claim. In particular, we show that CTJEM is especially effective for the semi-supervised Bayesian classifier model given a highly sparse data set.