Online outlier detection in sensor data using non-parametric models
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
A dip in the reservoir: maintaining sample synopses of evolving datasets
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
On synopses for distinct-value estimation under multiset operations
Proceedings of the 2007 ACM SIGMOD international conference on Management of data
Maintaining bernoulli samples over evolving multisets
Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling time-based sliding windows in bounded space
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
Sampling-based estimators for subset-based queries
The VLDB Journal — The International Journal on Very Large Data Bases
AMID: Approximation of MultI-measured Data using SVD
Information Sciences: an International Journal
Estimating the confidence of conditional functional dependencies
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Stream warehousing with DataDepot
Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Distinct-value synopses for multiset operations
Communications of the ACM - A View of Parallel Computing
Stratified reservoir sampling over heterogeneous data streams
SSDBM'10 Proceedings of the 22nd international conference on Scientific and statistical database management
The VC-dimension of SQL queries and selectivity estimation through sampling
ECML PKDD'11 Proceedings of the 2011 European conference on Machine learning and knowledge discovery in databases - Volume Part II
Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches
Foundations and Trends in Databases
Adaptive stratified reservoir sampling over heterogeneous data streams
Information Systems
Non-uniformity issues and workarounds in bounded-size sampling
The VLDB Journal — The International Journal on Very Large Data Bases
Hi-index | 0.00 |
We consider the problem of maintaining a warehouse of sampled data that "shadows" a full-scale data warehouse, in order to support quick approximate analytics and metadata discovery. The full-scale warehouse comprises many "data sets," where a data set is a bag of values; the data sets can vary enormously in size. The values constituting a data set can arrive in batch or stream form. We provide and compare several new algorithms for independent and parallel uniform random sampling of data-set partitions, where the partitions are created by dividing the batch or splitting the stream. We also provide novel methods for merging samples to create a uniform sample from an arbitrary union of data-set partitions. Our sampling/merge methods are the first to simultaneously support statistical uniformity, a priori bounds on the sample footprint, and concise sample storage. As partitions are rolled in and out of the warehouse, the corresponding samples are rolled in and out of the sample warehouse. In this manner our sampling methods approximate the behavior of more sophisticated stream-sampling methods, while also supporting parallel processing. Experiments indicate that our methods are efficient and scalable, and provide guidance for their application.