Data squashing: constructing summary data sets

  • Authors:
  • William DuMouchel

  • Affiliations:
  • AT&T Labs Research, Florham Park, NJ

  • Venue:
  • Handbook of massive data sets
  • Year:
  • 2002

Quantified Score

Hi-index 0.00

Visualization

Abstract

A "large dataset" is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined "data squashing" as the construction of a substitute smaller dataset that leads to approximately the same analysis results as the large dataset. Formally, data squashing is a type of lossy compression that attempts to preserve statistical information. To be efficient, squashing must improve upon the common strategy of taking a random sample from the large dataset. Three recent papers on data squashing are summarized and their results are compared.