Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
Compilers: principles, techniques, and tools
Compilers: principles, techniques, and tools
An efficient algorithm for sequential random sampling
ACM Transactions on Mathematical Software (TOMS)
Error-constrained COUNT query evaluation in relational databases
SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation
SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling
Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An effective hash-based algorithm for mining association rules
SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Computational strategies for the Riemann zeta function
Journal of Computational and Applied Mathematics - Special issue on numerical analysis in the 20th century vol. 1: approximation theory
Sliding-window filtering: an efficient algorithm for incremental mining
Proceedings of the tenth international conference on Information and knowledge management
Models and issues in data stream systems
Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Maintaining stream statistics over sliding windows: (extended abstract)
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Modern Information Retrieval
Wavelet synopses with error guarantees
Proceedings of the 2002 ACM SIGMOD international conference on Management of data
An efficient and effective algorithm for density biased sampling
Proceedings of the eleventh international conference on Information and knowledge management
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
Data Mining and Knowledge Discovery
Dynamic Maintenance of Wavelet-Based Histograms
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering
VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases
VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms
VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Maintaining variance and k-medians over data stream windows
Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A new two-phase sampling based algorithm for discovering association rules
Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules
RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules
ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets
IEEE Transactions on Knowledge and Data Engineering
Efficient data reduction with EASE
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate counts and quantiles over sliding windows
PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling algorithms in a stream operator
Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate frequency counts over data streams
VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A regression-based temporal pattern mining scheme for data streams
VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
REHIST: relative error histogram construction algorithms
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient computation of frequent and top-k elements in data streams
ICDT'05 Proceedings of the 10th international conference on Database Theory
Progressive sampling for association rules based on sampling error estimation
PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Hi-index | 0.00 |
In this article, we explore a novel sampling model, called feature preserved sampling (FPS) that sequentially generates a high-quality sample over sliding windows. The sampling quality we consider refers to the degree of consistency between the sample proportion and the population proportion of each attribute value in a window. Due to the time-variant nature of real-world datasets, users are more likely to be interested in the most recent data. However, previous works have not been able to generate a high-quality sample over sliding windows that precisely preserves up-to-date population characteristics. Motivated by this shortcoming, we have developed the FPS algorithm, which has several advantages: (1) it sequentially generates a sample from a time-variant data source over sliding windows; (2) the execution time of FPS is linear with respect to the database size; (3) the relative proportional differences between the sample proportions and population proportions of most distinct attribute values are guaranteed to be below a specified error threshold, ϵ, while the relative proportion differences of the remaining attribute values are as close to ϵ as possible, which ensures that the generated sample is of high quality; (4) the sample rate is close to the user specified rate so that a high quality sampling result can be obtained without increasing the sample size; (5) by a thorough analytical and empirical study, we prove that FPS has acceptable space overheads, especially when the attribute values have Zipfian distributions, and FPS can also excellently preserve the population proportion of multivariate features in the sample; and (6) FPS can be applied to infinite streams and finite datasets equally, and the generated samples can be used for various applications. Our experiments on both real and synthetic data validate that FPS can effectively obtain a high quality sample of the desired size. In addition, while using the sample generated by FPS in various mining applications, a significant improvement in efficiency can be achieved without compromising the model's precision.