Analysis of sampling techniques for association rule mining

Authors:
Venkatesan T. Chakaravarthy;Vinayaka Pandit;Yogish Sabharwal
Affiliations:
IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi;IBM India Research Lab, New Delhi
Venue:
Proceedings of the 12th International Conference on Database Theory
Year:
2009

Citing 6
Cited 9

Mining association rules between sets of items in large databases

SIGMOD '93 Proceedings of the 1993 ACM SIGMOD international conference on Management of data
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Association mining

ACM Computing Surveys (CSUR)

Output space sampling for graph patterns

Proceedings of the VLDB Endowment
Mining top-K frequent itemsets through progressive sampling

Data Mining and Knowledge Discovery
Locality sensitive hashing for sampling-based algorithms in association rule mining

Expert Systems with Applications: An International Journal
Sampling correctly for improving classification accuracy: a hybrid higher order neural classifier (HHONC) approach

Proceedings of the International Conference on Advances in Computing, Communications and Informatics
Looking for a structural characterization of the sparseness measure of (frequent closed) itemset contexts

Information Sciences: an International Journal
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I
Towards realistic sampling: generating dependencies in a relational database

Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
Generation of test databases using sampling methods

Proceedings of the 2013 International Symposium on Software Testing and Analysis
Discovering and managing quantitative association rules

Proceedings of the 22nd ACM international conference on Conference on information & knowledge management

Quantified Score

Hi-index	0.00

Visualization

Abstract

In this paper, we present a comprehensive theoretical analysis of the sampling technique for the association rule mining problem. Most of the previous works have concentrated only on the empirical evaluation of the effectiveness of sampling for the step of finding frequent itemsets. To the best of our knowledge, a theoretical framework to analyze the quality of the solutions obtained by sampling has not been studied. Our contributions are two-fold. First, we present the notions of ε-close frequent itemset mining and ε-close association rule mining that help assess the quality of the solutions obtained by sampling. Secondly, we show that both the frequent items mining and association rule mining problems can be solved satisfactorily with a sample size that is independent of both the number of transactions size and the number of items. Let θ be the required support, ε the closeness parameter, and 1/h the desired bound on the probability of failure. We show that the sampling based analysis succeeds in solving both ε-close frequent itemset mining and ε-close association rule mining with a probability of at least (1 - 1/h) with a sample of size S = O(1/ε2θ [Δ + log h/(1 - ε)θ]), where Δ is the maximum number of items present in any transaction. Thus, we establish that it is possible to speed up the entire process of association rule mining for massive databases by working with a small sample while retaining any desired degree of accuracy. Our work gives a comprehensive explanation for the well known empirical successes of sampling for association rule mining.