Feature-preserved sampling over streaming data

Authors:
Kun-Ta Chuang;Hung-Leng Chen;Ming-Syan Chen
Affiliations:
National Taiwan University, Taipei, Taiwan;National Taiwan University, Taipei, Taiwan;National Taiwan University, Taipei, Taiwan
Venue:
ACM Transactions on Knowledge Discovery from Data (TKDD)
Year:
2009

Citing 40
Cited 0

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Compilers: principles, techniques, and tools

Compilers: principles, techniques, and tools
An efficient algorithm for sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
Error-constrained COUNT query evaluation in relational databases

SIGMOD '91 Proceedings of the 1991 ACM SIGMOD international conference on Management of data
Sequential sampling procedures for query size estimation

SIGMOD '92 Proceedings of the 1992 ACM SIGMOD international conference on Management of data
Query size estimation by adaptive sampling

Selected papers of the 9th annual ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems
An effective hash-based algorithm for mining association rules

SIGMOD '95 Proceedings of the 1995 ACM SIGMOD international conference on Management of data
CURE: an efficient clustering algorithm for large databases

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Data mining: practical machine learning tools and techniques with Java implementations

Data mining: practical machine learning tools and techniques with Java implementations
Density biased sampling: an improved method for data mining and clustering

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Computational strategies for the Riemann zeta function

Journal of Computational and Applied Mathematics - Special issue on numerical analysis in the 20th century vol. 1: approximation theory
Sliding-window filtering: an efficient algorithm for incremental mining

Proceedings of the tenth international conference on Information and knowledge management
Models and issues in data stream systems

Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Maintaining stream statistics over sliding windows: (extended abstract)

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Modern Information Retrieval

Modern Information Retrieval
Wavelet synopses with error guarantees

Proceedings of the 2002 ACM SIGMOD international conference on Management of data
An efficient and effective algorithm for density biased sampling

Proceedings of the eleventh international conference on Information and knowledge management
Adaptive Sampling Methods for Scaling Up Knowledge Discovery Algorithms

Data Mining and Knowledge Discovery
Dynamic Maintenance of Wavelet-Based Histograms

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
ICICLES: Self-Tuning Samples for Approximate Query Answering

VLDB '00 Proceedings of the 26th International Conference on Very Large Data Bases
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

Proceedings of the 27th International Conference on Very Large Data Bases
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Maintaining variance and k-medians over data stream windows

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Efficient Biased Sampling for Approximate Clustering and Outlier Detection in Large Data Sets

IEEE Transactions on Knowledge and Data Engineering
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate counts and quantiles over sliding windows

PODS '04 Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
A regression-based temporal pattern mining scheme for data streams

VLDB '03 Proceedings of the 29th international conference on Very large data bases - Volume 29
False positive or false negative: mining frequent itemsets from high speed transactional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
REHIST: relative error histogram construction algorithms

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
Efficient computation of frequent and top-k elements in data streams

ICDT'05 Proceedings of the 10th international conference on Database Theory
Progressive sampling for association rules based on sampling error estimation

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this article, we explore a novel sampling model, called feature preserved sampling (FPS) that sequentially generates a high-quality sample over sliding windows. The sampling quality we consider refers to the degree of consistency between the sample proportion and the population proportion of each attribute value in a window. Due to the time-variant nature of real-world datasets, users are more likely to be interested in the most recent data. However, previous works have not been able to generate a high-quality sample over sliding windows that precisely preserves up-to-date population characteristics. Motivated by this shortcoming, we have developed the FPS algorithm, which has several advantages: (1) it sequentially generates a sample from a time-variant data source over sliding windows; (2) the execution time of FPS is linear with respect to the database size; (3) the relative proportional differences between the sample proportions and population proportions of most distinct attribute values are guaranteed to be below a specified error threshold, ϵ, while the relative proportion differences of the remaining attribute values are as close to ϵ as possible, which ensures that the generated sample is of high quality; (4) the sample rate is close to the user specified rate so that a high quality sampling result can be obtained without increasing the sample size; (5) by a thorough analytical and empirical study, we prove that FPS has acceptable space overheads, especially when the attribute values have Zipfian distributions, and FPS can also excellently preserve the population proportion of multivariate features in the sample; and (6) FPS can be applied to infinite streams and finite datasets equally, and the generated samples can be used for various applications. Our experiments on both real and synthetic data validate that FPS can effectively obtain a high quality sample of the desired size. In addition, while using the sample generated by FPS in various mining applications, a significant improvement in efficiency can be achieved without compromising the model's precision.