Applications of sampling and fractional factorial designs to model-free data squashing

  • Authors:
  • William DuMouchel;Deepak K. Agarwal

  • Affiliations:
  • AT&T Labs Research, Florham Park, NJ;AT&T Labs Research, Florham Park, NJ

  • Venue:
  • Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
  • Year:
  • 2003

Quantified Score

Hi-index 0.00

Visualization

Abstract

The concept of "data squashing" was introduced by DuMouchel et al [4] as a method of summarizing massive data sets that preserves statistical relationships among variables. The idea is to create a smaller data set that allows statistical modeling to take place using in-memory algorithms, and to preserve the modeling results more accurately than would a same-size random sample from the massive data set. This research attempts to avoid several limitations of previous approaches to data squashing. Our method avoids the curse of dimensionality by a double use of principal components transformations that makes computing time linear in the number of cases and quadratic in the number of variables. Categorical and continuous variables are smoothly integrated. Because the binning is based on principal components, which are uncorrelated, we can use fractional factorial designs that sample less than one point per bin. We also investigate various weighting schemes for the squashed sample to see whether matching moments or matching subregion data counts is more effective. Finally, previous work required the specification of a statistical model, either to perform the squashing algorithm or to compare the worth of different squashing methods. Our approach to evaluation is model free and does not even require the specification of variables as responses or predictors. Instead, we develop a chi-squared like measure of accuracy to compare the closeness of various discrete densities (the squashed data sets) to the discrete massive data set.