Techniques for Warehousing of Sample Data

Authors:
Paul G. Brown;Peter J. Haas
Affiliations:
IBM Almaden Research Center;IBM Almaden Research Center
Venue:
ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
Year:
2006

Citing 0
Cited 15

Online outlier detection in sensor data using non-parametric models

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A dip in the reservoir: maintaining sample synopses of evolving datasets

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
On synopses for distinct-value estimation under multiset operations

Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Maintaining bernoulli samples over evolving multisets

Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling time-based sliding windows in bounded space

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Sampling-based estimators for subset-based queries

The VLDB Journal — The International Journal on Very Large Data Bases
AMID: Approximation of MultI-measured Data using SVD

Information Sciences: an International Journal
Estimating the confidence of conditional functional dependencies

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Stream warehousing with DataDepot

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distinct-value synopses for multiset operations

Communications of the ACM - A View of Parallel Computing
Stratified reservoir sampling over heterogeneous data streams

SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
The VC-dimension of SQL queries and selectivity estimation through sampling

ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches

Foundations and Trends in Databases
Adaptive stratified reservoir sampling over heterogeneous data streams

Information Systems
Non-uniformity issues and workarounds in bounded-size sampling

The VLDB Journal — The International Journal on Very Large Data Bases

Quantified Score

Hi-index	0.00

Visualization

Abstract

We consider the problem of maintaining a warehouse of sampled data that "shadows" a full-scale data warehouse, in order to support quick approximate analytics and metadata discovery. The full-scale warehouse comprises many "data sets," where a data set is a bag of values; the data sets can vary enormously in size. The values constituting a data set can arrive in batch or stream form. We provide and compare several new algorithms for independent and parallel uniform random sampling of data-set partitions, where the partitions are created by dividing the batch or splitting the stream. We also provide novel methods for merging samples to create a uniform sample from an arbitrary union of data-set partitions. Our sampling/merge methods are the first to simultaneously support statistical uniformity, a priori bounds on the sample footprint, and concise sample storage. As partitions are rolled in and out of the warehouse, the corresponding samples are rolled in and out of the sample warehouse. In this manner our sampling methods approximate the behavior of more sophisticated stream-sampling methods, while also supporting parallel processing. Experiments indicate that our methods are efficient and scalable, and provide guidance for their application.