Deterministic algorithms for sampling count data

Authors:
Hüseyin Akcan;Alex Astashyn;Hervé Brönnimann
Affiliations:
Computer and Information Science Department, Polytechnic University, Brooklyn, NY 11201, United States;Computer and Information Science Department, Polytechnic University, Brooklyn, NY 11201, United States;Computer and Information Science Department, Polytechnic University, Brooklyn, NY 11201, United States
Venue:
Data & Knowledge Engineering
Year:
2008

Citing 17
Cited 5

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Online aggregation

SIGMOD '97 Proceedings of the 1997 ACM SIGMOD international conference on Management of data
New sampling-based summary statistics for improving approximate query answers

SIGMOD '98 Proceedings of the 1998 ACM SIGMOD international conference on Management of data
KDD-Cup 2000 organizers' report: peeling the onion

ACM SIGKDD Explorations Newsletter - Special issue on “Scalable data mining algorithms”
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Total

ICDE '96 Proceedings of the Twelfth International Conference on Data Engineering
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Fast Incremental Maintenance of Approximate Histograms

VLDB '97 Proceedings of the 23rd International Conference on Very Large Data Bases
Estimating Rarity and Similarity over Data Stream Windows

ESA '02 Proceedings of the 10th Annual European Symposium on Algorithms
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of Sampling for Data Mining of Association Rules

Evaluation of Sampling for Data Mining of Association Rules
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Approximation and streaming algorithms for histogram construction problems

ACM Transactions on Database Systems (TODS)
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Learn more, sample less: control of volume and variance in network measurement

IEEE Transactions on Information Theory

Fast UDFs to compute sufficient statistics on large data sets exploiting caching and sampling

Data & Knowledge Engineering
Approximating sliding windows by cyclic tree-like histograms for efficient range queries

Data & Knowledge Engineering
Locality sensitive hashing for sampling-based algorithms in association rule mining

Expert Systems with Applications: An International Journal
ML-DS: a novel deterministic sampling algorithm for association rules mining

ICDM'12 Proceedings of the 12th Industrial conference on Advances in Data Mining: applications and theoretical aspects
Mining frequent items in data stream using time fading model

Information Sciences: an International Journal

Quantified Score

Hi-index	0.00

Visualization

Abstract

Processing and extracting meaningful knowledge from count data is an important problem in data mining. The volume of data is increasing dramatically as the data is generated by day-to-day activities such as market basket data, web clickstream data or network data. Most mining and analysis algorithms require multiple passes over the data, which requires extreme amounts of time. One solution to save time would be to use samples, since sampling is a good surrogate for the data and the same sample can be used to answer many kinds of queries. In this paper, we propose two deterministic sampling algorithms, Biased-L2 and DRS. Both produce samples vastly superior to the previous deterministic and random algorithms, both in sample quality and accuracy. Our algorithms also improve on the run-time and memory footprint of the existing deterministic algorithms. The new algorithms can be used to sample from a relational database as well as data streams, with the ability to examine each transaction only once, and maintain the sample on-the-fly in a streaming fashion. We further show how to engineer one of our algorithms (DRS) to adapt and recover from changes to the underlying data distribution, or sample size. We evaluate our algorithms on three different synthetic datasets, as well as on real-world clickstream data, and demonstrate the improvements over previous art.