KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Applications of sampling and fractional factorial designs to model-free data squashing
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Classification of large data sets with mixture models via sufficient EM
Computational Statistics & Data Analysis
Simple incremental instance selection wrapper for classification
ICAISC'12 Proceedings of the 11th international conference on Artificial Intelligence and Soft Computing - Volume Part II
Hi-index | 0.00 |
A "large dataset" is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined "data squashing" as the construction of a substitute smaller dataset that leads to approximately the same analysis results as the large dataset. Formally, data squashing is a type of lossy compression that attempts to preserve statistical information. To be efficient, squashing must improve upon the common strategy of taking a random sample from the large dataset. Three recent papers on data squashing are summarized and their results are compared.