Which Is Better for Frequent Pattern Mining: Approximate Counting or Sampling?

Authors:
Willie Ng;Manoranjan Dash
Affiliations:
Centre for Advanced Information Systems, Nanyang Technological University, Singapore 639798;Centre for Advanced Information Systems, Nanyang Technological University, Singapore 639798
Venue:
DaWaK '09 Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery
Year:
2009

Citing 16
Cited 0

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
False positive or false negative: mining frequent itemsets from high speed transactional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A survey on algorithms for mining frequent itemsets over data streams

Knowledge and Information Systems
Efficient Approximate Mining of Frequent Patterns over Transactional Data Streams

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Mining Frequent Itemsets in a Stream

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
An empirical study of the noise impact on cost-sensitive learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Maintaining frequent itemsets over high-speed data streams

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining
Progressive sampling for association rules based on sampling error estimation

PAKDD'05 Proceedings of the 9th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Quantified Score

Hi-index	0.00

Visualization

Abstract

We investigate the problem of finding frequent patterns in a continuous stream of transactions. In the literature two prominent approaches are often used: (a) perform approximate counting (e.g., lossy counting algorithm (LCA) of Manku and Motwani, VLDB 2002) by using a lower support threshold than the one given by the user, or (b) maintain a running sample (e.g., reservoir sampling (Algo-Z) of Vitter, TOMS 1985) and generate frequent itemsets from the sample on demand. Both approaches have their advantages and disadvantages. For instance, LCA is known to output all frequent itemsets (recall = 1) but it also outputs many false frequent itemsets (low precision). Sampling is fast, but it outputs a large number of false itemsets as frequent itemsets, particularly when sample size is not large. Although both approaches are known to be practically useful, to the best of our knowledge there has been no comparison between the two approaches. In addition, we propose a novel sampling algorithm (DSS ). DSS selects transactions to be included in the sample based on histogram of single itemsets. An empirical comparison study between the 3 algorithms is performed using synthetic and benchmark datasets. Results show that DSS is consistently more accurate than LCA and Algo-Z, whereas LCA performs consistently better than Algo-Z. Furthermore, DSS , although requires more time than Algo-Z, is faster than LCA.