A comparison between approximate counting and sampling methods for frequent pattern mining on data streams

  • Authors:
  • Willie Ng;Manoranjan Dash

  • Affiliations:
  • (Corrrespd. E-mail: WillieNg@pmail.ntu.edu.sg) School of Computer Engineering, Nanyang Technological University, Singapore;School of Computer Engineering, Nanyang Technological University, Singapore

  • Venue:
  • Intelligent Data Analysis
  • Year:
  • 2010

Quantified Score

Hi-index 0.00

Visualization

Abstract

We investigate the problem of finding frequent patterns in a continuous stream of transactions. In the literature, two prominent approaches are often used: (a) perform approximate counting (e.g., lossy counting algorithm (LCA) of Manku and Motwani, VLDB 2002) by using a lower support threshold than the one given by the user, or (b) maintain a running sample (e.g., reservoir sampling (Algo-Z) of Vitter, TOMS 1985) and generate frequent itemsets from the sample on demand. Although both are known to be practically useful, to the best of our knowledge, there has been no comparison between them. In addition, we propose a distance based sampling algorithm (DSS). An empirical comparison study on the algorithms is performed using synthetic and benchmark datasets. Results show that DSS is consistently more accurate than LCA and Algo-Z, whereas LCA performs better than Algo-Z. An outcome of this study is a new algorithm CLCA. In LCA, the proper quantification of the error parameter, ε, is non-trival. CLCA is an attempt to exploit this fact in proposing a new customized LCA algorithm. Interestingly, CLCA outperforms all other algorithms (including DSS) in mining for the frequent itemsets of user's choice.