Discovery of frequent patterns in transactional data streams

Authors:
Willie Ng;Manoranjan Dash
Affiliations:
Centre for Advanced Information Systems, Nanyang Technological University, Singapore;Centre for Advanced Information Systems, Nanyang Technological University, Singapore
Venue:
Transactions on large-scale data- and knowledge-centered systems II
Year:
2010

Citing 41
Cited 1

Random sampling with a reservoir

ACM Transactions on Mathematical Software (TOMS)
A guided tour of Chernoff bounds

Information Processing Letters
Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n)))

ACM Transactions on Mathematical Software (TOMS)
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Congressional samples for approximate answering of group-by queries

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Mining high-speed data streams

Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining
Sampling from a moving window over streaming data

SODA '02 Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms
Is Sampling Useful in Data Mining? A Case in the Maintenance of Discovered Association Rules

Data Mining and Knowledge Discovery
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A simple algorithm for finding frequent elements in streams and bags

ACM Transactions on Database Systems (TODS)
Maintaining time-decaying stream aggregates

Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining
Online Data Mining for Co-Evolving Time Sequences

ICDE '00 Proceedings of the 16th International Conference on Data Engineering
Probabilistic Noise Identification and Data Cleaning

ICDM '03 Proceedings of the Third IEEE International Conference on Data Mining
Efficient data reduction with EASE

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Finding recent frequent itemsets adaptively over online data streams

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
estWin: adaptively monitoring the recent change of frequent itemsets over online data streams

CIKM '03 Proceedings of the twelfth international conference on Information and knowledge management
Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window

ICDM '04 Proceedings of the Fourth IEEE International Conference on Data Mining
Sampling algorithms in a stream operator

Proceedings of the 2005 ACM SIGMOD international conference on Management of data
Data Mining: Concepts and Techniques

Data Mining: Concepts and Techniques
Online Mining (Recently) Maximal Frequent Itemsets over Data Streams

RIDE '05 Proceedings of the 15th International Workshop on Research Issues in Data Engineering: Stream Data Mining and Applications
Research issues in data stream association rule mining

ACM SIGMOD Record
On biased reservoir sampling in the presence of stream evolution

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Improved Association Rule Mining by Modified Trimming

CIT '06 Proceedings of the Sixth IEEE International Conference on Computer and Information Technology
Data Streams: Models and Algorithms (Advances in Database Systems)

Data Streams: Models and Algorithms (Advances in Database Systems)
An Evaluation of Progressive Sampling for Imbalanced Data Sets

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Efficient Reservoir Sampling for Transactional Data Streams

ICDMW '06 Proceedings of the Sixth IEEE International Conference on Data Mining - Workshops
Frequent pattern mining: current status and future directions

Data Mining and Knowledge Discovery
Multi-dimensional regression analysis of time-series data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
Approximate frequency counts over data streams

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
StatStream: statistical monitoring of thousands of data streams in real time

VLDB '02 Proceedings of the 28th international conference on Very Large Data Bases
False positive or false negative: mining frequent itemsets from high speed transactional data streams

VLDB '04 Proceedings of the Thirtieth international conference on Very large data bases - Volume 30
A survey on algorithms for mining frequent itemsets over data streams

Knowledge and Information Systems
Efficient Approximate Mining of Frequent Patterns over Transactional Data Streams

DaWaK '08 Proceedings of the 10th international conference on Data Warehousing and Knowledge Discovery
Mining Frequent Itemsets in a Stream

ICDM '07 Proceedings of the 2007 Seventh IEEE International Conference on Data Mining
An empirical study of the noise impact on cost-sensitive learning

IJCAI'07 Proceedings of the 20th international joint conference on Artifical intelligence
Maintaining frequent itemsets over high-speed data streams

PAKDD'06 Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining

Enhancing source selection for live queries over linked data via query log mining

JIST'11 Proceedings of the 2011 joint international conference on The Semantic Web

Quantified Score

Hi-index	0.00

Visualization

Abstract

A data stream is generated continuously in a dynamic environment with huge volume, infinite flow, and fast changing behaviors. There have been increasing demands for developing novel techniques that are able to discover interesting patterns from data streams while they work within system resource constraints. In this paper, we overview the state-of-art techniques to mine frequent patterns in a continuous stream of transactions. In the literature two prominent approaches are often used: (a) perform approximate counting (e.g., lossy counting algorithm (LCA) of Manku and Motwani, VLDB 2002) by using a lower support threshold than the one given by the user, or (b) maintain a running sample (e.g., reservoir sampling (Algo-Z) of Vitter, TOMS 1985) and generate frequent patterns from the sample on demand. Although both approaches are practically useful, to the best of our knowledge there has been no comparison between the two approaches. We also introduce a novel sampling algorithm (DSS). DSS selects transactions to be included in the sample based on histogram of single itemsets. An empirical comparison study between the 3 algorithms is performed using synthetic and benchmark datasets. Results show that DSS is consistently more accurate than LCA and Algo-Z, whereas LCA performs consistently better than Algo-Z. Furthermore, DSS, although requires more time than Algo-Z, is faster than LCA.