Data squashing: constructing summary data sets

Authors:
William DuMouchel
Affiliations:
AT&T Labs Research, Florham Park, NJ
Venue:
Handbook of massive data sets
Year:
2002

Citing 1
Cited 3

Squashing flat files flatter

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining

Applications of sampling and fractional factorial designs to model-free data squashing

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Classification of large data sets with mixture models via sufficient EM

Computational Statistics & Data Analysis
Simple incremental instance selection wrapper for classification

ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II

Quantified Score

Hi-index	0.00

Visualization

Abstract

A "large dataset" is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined "data squashing" as the construction of a substitute smaller dataset that leads to approximately the same analysis results as the large dataset. Formally, data squashing is a type of lossy compression that attempts to preserve statistical information. To be efficient, squashing must improve upon the common strategy of taking a random sample from the large dataset. Three recent papers on data squashing are summarized and their results are compared.