Multi-scaling sampling: an adaptive sampling method for discovering approximate association rules

Authors:
Cai-Yan Jia;Xie-Ping Gao
Affiliations:
The Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, P.R. China and Graduate School of the Chinese Academy of Sciences ...;Information Engineering College, Xiangtan University, Xiangtan, P.R. China
Venue:
Journal of Computer Science and Technology
Year:
2005

Citing 14
Cited 4

A theory of the learnable

Communications of the ACM
An efficient algorithm for sequential random sampling

ACM Transactions on Mathematical Software (TOMS)
Fast discovery of association rules

Advances in knowledge discovery and data mining
Efficient progressive sampling

KDD '99 Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining
Mining frequent patterns without candidate generation

SIGMOD '00 Proceedings of the 2000 ACM SIGMOD international conference on Management of data
Parallel and Distributed Association Mining: A Survey

IEEE Concurrency
Parallel Mining of Association Rules

IEEE Transactions on Knowledge and Data Engineering
MAFIA: A Maximal Frequent Itemset Algorithm for Transactional Databases

Proceedings of the 17th International Conference on Data Engineering
Fast Algorithms for Mining Association Rules in Large Databases

VLDB '94 Proceedings of the 20th International Conference on Very Large Data Bases
Sampling Large Databases for Association Rules

VLDB '96 Proceedings of the 22th International Conference on Very Large Data Bases
Privacy preserving mining of association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
A new two-phase sampling based algorithm for discovering association rules

Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
Evaluation of sampling for data mining of association rules

RIDE '97 Proceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97) High Performance Database Management for Large-Scale Applications
Efficient Progressive Sampling for Association Rules

ICDM '02 Proceedings of the 2002 IEEE International Conference on Data Mining

Summary queries for frequent itemsets mining

Journal of Systems and Software
Sampling ensembles for frequent patterns

FSKD'05 Proceedings of the Second international conference on Fuzzy Systems and Knowledge Discovery - Volume Part I
A distributed hebb neural network for network anomaly detection

ISPA'07 Proceedings of the 5th international conference on Parallel and Distributed Processing and Applications
Efficient discovery of association rules and frequent itemsets through sampling with tight performance guarantees

ECML PKDD'12 Proceedings of the 2012 European conference on Machine Learning and Knowledge Discovery in Databases - Volume Part I

Quantified Score

Hi-index	0.00

Visualization

Abstract

One of the obstacles of the efficient association rule mining is the explosive expansion of data sets since it is costly or impossible to scan large databases, esp., for multiple times. A popular solution to improve the speed and scalability of the association rule mining is to do the algorithm on a random sample instead of the entire database. But how to effectively define and efficiently estimate the degree of error with respect to the outcome of the algorithm, and how to determine the sample size needed are entangling researches until now. In this paper, an effective and efficient algorithm is given based on the PAC (Probably Approximate Correct) learning theory to measure and estimate sample error. Then, a new adaptive, on-line, fast sampling strategy -- multi-scaling sampling -- is presented inspired by MRA (Multi-Resolution Analysis) and Shannon sampling theorem, for quickly obtaining acceptably approximate association rules at appropriate sample size. Both theoretical analysis and empirical study have showed that the sampling strategy can achieve a very good speed-accuracy trade-off.