Random sampling with a reservoir
ACM Transactions on Mathematical Software (TOMS)
An efficient algorithm for sequential random sampling
ACM Transactions on Mathematical Software (TOMS)
BIRCH: an efficient data clustering method for very large databases
SIGMOD '96 Proceedings of the 1996 ACM SIGMOD international conference on Management of data
Incremental clustering and dynamic information retrieval
STOC '97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing
CURE: an efficient clustering algorithm for large databases
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Automatic subspace clustering of high dimensional data for data mining applications
SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient progressive sampling
KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: practical machine learning tools and techniques with Java implementations
Data mining: practical machine learning tools and techniques with Java implementations
Density biased sampling: an improved method for data mining and clustering
SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Data mining: concepts and techniques
Data mining: concepts and techniques
Sliding-window filtering: an efficient algorithm for incremental mining
Proceedings of the tenth international conference on Information and knowledge management
Sampling from a moving window over streaming data
SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules
Data Mining and Knowledge Discovery
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms
Data Mining and Knowledge Discovery
Knowledge Acquisition Via Incremental Conceptual Clustering
Machine Learning
An Incremental Hierarchical Data Clustering Algorithm Based on Gravity Theory
PAKDD '02 Proceedings of the 6th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining
Evaluation of sampling for data mining of association rules
RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient data reduction with EASE
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining concept-drifting data streams using ensemble classifiers
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding recent frequent itemsets adaptively over online data streams
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Probabilistic wavelet synopses
ACM Transactions on Database Systems (TODS)
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
REHIST: relative error histogram construction algorithms
VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient computation of frequent and top-k elements in data streams
ICDT'05 Proceedings of the 10th international conference on Database Theory
Hi-index | 0.00 |
We explore in this paper a novel sampling algorithm, referred to as algorithm PAS (standing for Proportion Approximation Sampling), to generate a high-quality online sample with the desired sample rate. The sampling quality refers to the consistency between the population proportion and the sample proportion of each categorical value in the database. Note that the state-of-the-art sampling algorithm to preserve the sampling quality has to examine the population proportion of each categorical value in a pilot sample a priori and is thus not applicable to incremental mining applications. To remedy this, algorithm PAS adaptively determines the inclusion probability of each incoming tuple in such a way that the sampling quality can be sequentially preserved while also guaranteeing the sample rate close to the user specified one. Importantly, PAS not only guarantees the proportion consistency of each categorical value but also excellently preserves the proportion consistency of multivariate statistics, which will be significantly beneficial to various data mining applications. For better execution efficiency, we further devise an algorithm, called algorithm EQAS (standing for Efficient Quality-Aware Sampling), which integrates PAS and random sampling to provide the flexibility of striking a compromise between the sampling quality and the sampling efficiency. As validated in experimental results on real and synthetic data, algorithm PAS can stably provide high-quality samples with corresponding computational overhead, whereas algorithm EQAS can flexibly generate samples with the desired balance between sampling quality and sampling efficiency. In addition, while applying the sample generated by algorithms PAS and EQAS to incremental mining applications, a significant efficiency improvement can be obtained without compromising the resulting precision, showing the prominent advantage of both proposed algorithms to be the quality-aware sampling means for incremental mining applications.